Python Regex Can't Find Substring But It Should
I am trying to parse html using BeautifulSoup to try and extract the webpage title. Sometimes this does not work due to the website being badly written, such as Bad End tag. When
Solution 1:
You should use the dotall flag to make the .
match newline characters as well.
result = re.search('\<title\>(.+?)\</title\>', html, re.DOTALL)
As the documentation says:
...without this flag,
'.'
will match anything except a newline
Solution 2:
If you want to grab the test between the <title>
and <\title>
tags you should use this regexp:
pattern= "<title>([^<]+)</title>"
re.findall(pattern, html_string)
Post a Comment for "Python Regex Can't Find Substring But It Should"