Extracting Url From Style: Background-url: With Beautifulsoup And Without Regex?
I have:
Solution 1:
You could try using the cssutils package. Something like this should work:
import cssutils
from bs4 import BeautifulSoup
html = """<div class="image" style="background-image: url('/uploads/images/players/16113-1399107741.jpeg');" />"""
soup = BeautifulSoup(html)
div_style = soup.find('div')['style']
style = cssutils.parseStyle(div_style)
url = style['background-image']
>>> url
u'url(/uploads/images/players/16113-1399107741.jpeg)'>>> url = url.replace('url(', '').replace(')', '') # or regex/split/find/slice etc.>>> url
u'/uploads/images/players/16113-1399107741.jpeg'
Although you are ultimately going to need to parse out the actual url this method should be more resilient to changes in the HTML. If you really dislike string manipulation and regex, you can pull the url out in this roundabout way:
sheet = cssutils.css.CSSStyleSheet()
sheet.add("dummy_selector { %s }" % div_style)
url = list(cssutils.getUrls(sheet))[0]
>>> url
u'/uploads/images/players/16113-1399107741.jpeg'
Solution 2:
How about using str.split
:
>>> style
'<div ... url(\'/uploads/images/players/16113-1399107741.jpeg\');"'
>>> style.split("('", 1)[1].split("')")[0]
'/uploads/images/players/16113-1399107741.jpeg'
Solution 3:
from bs4 import BeautifulSoup
import re
html = """<div class="image" style="background-image: url('/uploads/images/players/16113-1399107741.jpeg');"""
soup = BeautifulSoup(html,'html.parser')
image_div = soup.find('div')['style']
ptr = re.search("http.*[)]",image_div) # regex to search url till ')'print(image_div[ptr.start():ptr.end()-1]) # end() -1 to remove ')'
Solution 4:
Without regex, you can just use str.find
and str slice:
>>>s
"background-image: url('/uploads/images/players/16113-1399107741.jpeg');"
>>>s.find("('")
21
>>>s.find("')")
68
>>>s[21+len("('"):68]
'/uploads/images/players/16113-1399107741.jpeg'
But however, I think it's better to use regex in your case.
Solution 5:
In [1]: s = "background-image: url('/uploads/images/players/16113-1399107741.jpeg');"
In [2]: start= s.find("url('")
In [3]: startOut[3]: 18In [4]: end= s.find("');")
In [5]: endOut[5]: 68In [6]: url = s[start+len("url('"):end]
In [7]: url
Out[7]: '/uploads/images/players/16113-1399107741.jpeg'
Post a Comment for "Extracting Url From Style: Background-url: With Beautifulsoup And Without Regex?"