Skip to content Skip to sidebar Skip to footer

Accommodate Two Types Of Quotes In A Regex

I am using a regex to replace quotes within in an input string. My data contains two 'types' of quotes - ' and “ There's a very subtle difference between the two. Currently, I a

Solution 1:

I don't think there is a "quotation marks" character class in Python's regex implementation so you'll have to do the matching yourself.

You could keep a list of common quotation mark unicode characters (here's a list for a good start) and build the part of regex that matches quotation marks programmatically.

Solution 2:

I can only help you with the original question about quotations marks. As it turns out, Unicode defines many properties per character and these are all available though the Unicode Character Database. "Quotation mark" is one of these properties.

How many different types of quotes exist?

29, according to Unicode, see below.

The Unicode standard brings us a definitive text file on Unicode properties, PropList.txt, among which a list of quotation marks. Since Python does not support all Unicode properties in regular expressions, you cannot currently use \p{QuotationMark}. However, it's trivial to create a regular expression character class:

// placed on multiple lines for readability, remove spaces// and then place in your regex in place of the current quotes
[\u0022   \u0027    \u00AB    \u00BB
\u2018    \u2019    \u201A    \u201B
\u201C    \u201D    \u201E    \u201F
\u2039    \u203A    \u300C    \u300D
\u300E    \u300F    \u301D    \u301E
\u301F    \uFE41    \uFE42    \uFE43
\uFE44    \uFF02    \uFF07    \uFF62
\uFF63]

As "tchrist" pointed out above, you can save yourself the trouble by using Matthew Barnett's regex library which supports \p{QuotationMark}.

Solution 3:

Turns out there's a much easier way to do this. Just append the literal 'u' in front of your regex you write in python.

regexp = ru'\"*\“*'

Make sure you use the re.UNICODE flag when you want to compile/search/match your regex to your string.

re.findall(regexp, string, re.UNICODE)

Don't forget to include the

#!/usr/bin/python# -*- coding:utf-8 -*-

at the start of the source file to make sure unicode strings can be written in your source file.

Post a Comment for "Accommodate Two Types Of Quotes In A Regex"