Skip to content Skip to sidebar Skip to footer

Python: How To Access And Iterate Over A List Of Div Class Element Using (beautifulsoup)

I'm parsing data about car production with BeautifulSoup (see also my first question): from bs4 import BeautifulSoup import string html = '''

Production Capacity (year)&

Solution 1:

This is my solution, You need to take care of each element tag and parse it accordingly. I went further to your problem and offered a more flexible way to access each data value. hope it helps.

import re

from bs4 import BeautifulSoup

html_doc = """
<h4>Production Capacity (year)</h4>
    <div class="profile-area">
    Vehicle 1,140,000 units /year
    </div>
<h4>Output</h4>
    <div class="profile-area">
    Vehicle 809,000 units ( 2016 ) 
    </div>
    <div class="profile-area">
    Vehicle 815,000 units ( 2015 ) 
    </div>
    <div class="profile-area">
    Vehicle 836,000 units ( 2014 ) 
    </div>
    <div class="profile-area">
    Vehicle 807,000 units ( 2013 ) 
    </div>
    <div class="profile-area">
    Vehicle 760,000 units ( 2012 ) 
    </div>
    <div class="profile-area">
    Vehicle 805,000 units ( 2011 ) 
    </div>"""

soup = BeautifulSoup(html_doc, 'html.parser')
h4_elements = soup.find_all('h4')
profile_areas = soup.find_all('div', attrs={'class': 'profile-area'})
print('\n')
print("++++++++++++++++++++++++++++++++++++")
print("Element counts")
print("++++++++++++++++++++++++++++++++++++")
print("Total H4: {}".format(len(h4_elements)))
print("++++++++++++++++++++++++++++++++++++")
print("Total profile-area: {}".format(len(profile_areas)))
print("++++++++++++++++++++++++++++++++++++")
print('\n')

for i in h4_elements:
    print("++++++++++++++++++++++++++++++++++++")
    print(i.text.rstrip().lstrip())
    print("++++++++++++++++++++++++++++++++++++")
    del profile_areas[0]
    for j in profile_areas:
        raw = re.sub('[^A-Za-z0-9]+', ' ', j.text.replace(',','').lstrip().rstrip())
        raw = raw.rstrip()
        el = raw.split(' ')

        print('Type: {} '.format(el[0]))
        print('Sold: {} {} '.format(el[1], el[2]))
        print('Year: {} '.format(el[3]))
        print("++++++++++++++++++++++++++++++++++++")

The output is the following:

 ++++++++++++++++++++++++++++++++++++
Production Capacity (year)
++++++++++++++++++++++++++++++++++++
Type:Vehicle Sold: 809000 units Year: 2016 
++++++++++++++++++++++++++++++++++++
Type:Vehicle Sold: 815000 units Year: 2015 
++++++++++++++++++++++++++++++++++++
Type:Vehicle Sold: 836000 units Year: 2014 
++++++++++++++++++++++++++++++++++++
Type:Vehicle Sold: 807000 units Year: 2013 
++++++++++++++++++++++++++++++++++++
Type:Vehicle Sold: 760000 units Year: 2012 
++++++++++++++++++++++++++++++++++++
Type:Vehicle Sold: 805000 units Year: 2011 
++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++
Output
++++++++++++++++++++++++++++++++++++
Type:Vehicle Sold: 815000 units Year: 2015 
++++++++++++++++++++++++++++++++++++
Type:Vehicle Sold: 836000 units Year: 2014 
++++++++++++++++++++++++++++++++++++
Type:Vehicle Sold: 807000 units Year: 2013 
++++++++++++++++++++++++++++++++++++
Type:Vehicle Sold: 760000 units Year: 2012 
++++++++++++++++++++++++++++++++++++
Type:Vehicle Sold: 805000 units Year: 2011 
++++++++++++++++++++++++++++++++++++

Solution 2:

I would suggest you store each entry in a dictionary, you can then extract the fields you want easily at the end (you don't seem to want 2011?):

from bs4 import BeautifulSoup
import re

html = """
<h4>Production Capacity (year)</h4>
    <div class="profile-area">
      Vehicle 1,140,000 units /year
    </div>
<h4>Output</h4>
    <div class="profile-area">
      Vehicle 809,000 units ( 2016 ) 
    </div>
    <div class="profile-area">
      Vehicle 815,000 units ( 2015 ) 
    </div>
    <div class="profile-area">
      Vehicle 836,000 units ( 2014 ) 
    </div>
    <div class="profile-area">
      Vehicle 807,000 units ( 2013 ) 
    </div>
    <div class="profile-area">
      Vehicle 760,000 units ( 2012 ) 
    </div>
    <div class="profile-area">
      Vehicle 805,000 units ( 2011 ) 
    </div>
"""

soup = BeautifulSoup(html, 'lxml')
units = {}

for item in soup.find_all(['h4', 'div']):
    if item.name == 'h4':
        for h4 in ['capacity', 'output', 'models']:
            if h4 in item.text.lower():
                breakelif item.get('class', [''])[0] == 'profile-area':
        vehicle = item.get_text(strip=True)

        if h4 == 'output':
            re_year = re.search(r'\( (\d+) \)', vehicle)

            if re_year:
                year = re_year.group(1)
            else:
                year = 'unknown'

            units[year] = vehicle
        else:
            units[h4] = vehicle

req_fields = ['models', 'capacity', '2012', '2013', '2014', '2015', '2016']            
print(';'.join([units.get(field, '') for field in req_fields]))

This would display:

;Vehicle 1,140,000 units /year;Vehicle 760,000 units ( 2012 );Vehicle 807,000 units ( 2013 );Vehicle 836,000 units ( 2014 );Vehicle 815,000 units ( 2015 );Vehicle 809,000 units ( 2016 )

A regular expression is used to extract the year from the vehicle entry. This is then used as the key in the dictionary.

For the HTML in pastebin it gives:

Volkswagen Golf, Golf Variant(Estate), Golf Plus, CrossGolf (2006-), e-Golf (2014-)Volkswagen Touran, CrossTouran (2007-), Tiguan (2007-);I.D. electric vehicles based on MEB (planning);SEAT new SUV MQB-A2 platform (2018- planning);Components:press shop, chassis, plastics technology;Vehicle 1,140,000 units /year;Vehicle 760,000 units ( 2012 );Vehicle 807,000 units ( 2013 );Vehicle 836,000 units ( 2014 );Vehicle 815,000 units ( 2015 );Vehicle 809,000 units ( 2016 )

Post a Comment for "Python: How To Access And Iterate Over A List Of Div Class Element Using (beautifulsoup)"