Skip to content Skip to sidebar Skip to footer

Python Regexp Not Match Sequence

I need to wrap some MathJax string with HTML tag. I wonder how to exclude \) from search string not to match full sting. With single char it's easy e.g [^)] but what to do when I n

Solution 1:

You are trying to match any text but \) 2-char substring, 2-char sequence of characters, with [^\\\)]+, which is wrong, because [^...] is a negated cahracter class that can match a single character falling into a specific range or set of chars defined in the class. It can never match char combinations, * or + quantifiers just repeat a single char matching.

What you think of is called a tempered greedy token, (?:(?!\\\)).)* or (?:(?!\\\)).)*?.

However, the tempered greedy token is not the best practice in this case. See the rexegg.com note on when not to use TGT:

For the task at hand, this technique presents no advantage over the lazy dot-star .*?{END}. Although their logic differs, at each step, before matching a character, both techniques force the engine to look if what follows is {END}.

The comparative performance of these two versions will depend on your engine's internal optimizations. The pcretest utility indicates that PCRE requires far fewer steps for the lazy-dot-star version. On my laptop, when running both expressions a million times against the string {START} Mary {END}, pcretest needs 400 milliseconds per 10,000 runs for the lazy version and 800 milliseconds for the tempered version.

Therefore, if the string that tempers the dot is a delimiter that we intend to match eventually (as with {END} in our example), this technique adds nothing to the lazy dot-star, which is better optimized for this job.

Your strings seem to be well-formed and rather short, use a mere lazy dot matching pattern, that is, \\\(.*?\\\) regex.

Besides, you need to use r prefix, a raw string literal, in the replacement pattern definition, or \1 will be parsed as a hex char (\x01, start of header).

import re
search_str = r"\(\ce{\sigma_{s}^{b}(H2O)}\) bla bla \(\ce{\sigma_{s}^{b}(H2O)}\)"print(search_str)
out = re.sub(r'(\\\(.*?\\\))', r'<span>\1</span>', search_str)
print(out)

See the Python demo

Solution 2:

I think that [^\\][^)] should do the trick, or. nearly so. That will match any two characters as long as the first isn't a slash, and the second isn't a closing paren. You could experiment with some grouping, too, if that's not exactly what you want.

Solution 3:

Thank to Sebastian's recommendation I used Tempered Greedy Token:

(\\\((?:(?!\\\)).)*\\\)

simply awesome :-)

Post a Comment for "Python Regexp Not Match Sequence"