Python Regexp Not Match Sequence
Solution 1:
You are trying to match any text but \)
2-char substring, 2-char sequence of characters, with [^\\\)]+
, which is wrong, because [^...]
is a negated cahracter class that can match a single character falling into a specific range or set of chars defined in the class. It can never match char combinations, *
or +
quantifiers just repeat a single char matching.
What you think of is called a tempered greedy token, (?:(?!\\\)).)*
or (?:(?!\\\)).)*?
.
However, the tempered greedy token is not the best practice in this case. See the rexegg.com note on when not to use TGT:
For the task at hand, this technique presents no advantage over the lazy dot-star
.*?{END}
. Although their logic differs, at each step, before matching a character, both techniques force the engine to look if what follows is{END}
.The comparative performance of these two versions will depend on your engine's internal optimizations. The pcretest utility indicates that PCRE requires far fewer steps for the lazy-dot-star version. On my laptop, when running both expressions a million times against the string
{START} Mary {END}
, pcretest needs 400 milliseconds per 10,000 runs for the lazy version and 800 milliseconds for the tempered version.Therefore, if the string that tempers the dot is a delimiter that we intend to match eventually (as with
{END}
in our example), this technique adds nothing to the lazy dot-star, which is better optimized for this job.
Your strings seem to be well-formed and rather short, use a mere lazy dot matching pattern, that is, \\\(.*?\\\)
regex.
Besides, you need to use r
prefix, a raw string literal, in the replacement pattern definition, or \1
will be parsed as a hex char (\x01
, start of header).
import re
search_str = r"\(\ce{\sigma_{s}^{b}(H2O)}\) bla bla \(\ce{\sigma_{s}^{b}(H2O)}\)"print(search_str)
out = re.sub(r'(\\\(.*?\\\))', r'<span>\1</span>', search_str)
print(out)
See the Python demo
Solution 2:
I think that [^\\][^)]
should do the trick, or. nearly so. That will match any two characters as long as the first isn't a slash, and the second isn't a closing paren. You could experiment with some grouping, too, if that's not exactly what you want.
Solution 3:
Thank to Sebastian's recommendation I used Tempered Greedy Token:
(\\\((?:(?!\\\)).)*\\\)
simply awesome :-)
Post a Comment for "Python Regexp Not Match Sequence"