Scraping Xml Element Attributes With Beautifulsoup
I have the following code: from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen('https://api.stlouisfed.org/fred/...') bsObj = BeautifulSoup(html.read(),
Solution 1:
Find all the attribute tags and just extract the attributes you want:
x = """<?xml version="1.0" encoding="utf-8" ?><html><body><observationscount="276"file_type="xml"limit="100000"observation_end="9999-12-31"observation_start="1776-07-04"offset="0"order_by="observation_date"output_type="1"realtime_end="2016-06-22"realtime_start="2016-06-22"sort_order="asc"units="lin"><observationdate="1947-04-01"realtime_end="2016-06-22"realtime_start="2016-06-22"value="-0.4"></observation><observationdate="1947-07-01"realtime_end="2016-06-22"realtime_start="2016-06-22"value="-0.4"></observation><observationdate="1947-10-01"realtime_end="2016-06-22"realtime_start="2016-06-22"value="6.4"></observation><observationdate="1948-01-01"realtime_end="2016-06-22"realtime_start="2016-06-22"value="6"></observation><observationdate="1948-04-01"realtime_end="2016-06-22"realtime_start="2016-06-22"value="6.7"></observation><observationdate="1948-07-01"realtime_end="2016-06-22"realtime_start="2016-06-22"value="2.3"></observation><observationdate="1948-10-01"realtime_end="2016-06-22"realtime_start="2016-06-22"value="0.4"></observation><observationdate="1949-01-01"realtime_end="2016-06-22"realtime_start="2016-06-22"value="-5.4"></observation><observationdate="1949-04-01"realtime_end="2016-06-22"realtime_start="2016-06-22"value="-1.3"></observation><observationdate="1949-07-01"realtime_end="2016-06-22"realtime_start="2016-06-22"value="4.5"></observation><observationdate="1949-10-01"realtime_end="2016-06-22"realtime_start="2016-06-22"value="-3.5"></observation><observationdate="1950-01-01"realtime_end="2016-06-22"realtime_start="2016-06-22"value="16.9"></observation><observationdate="1950-04-01"realtime_end="2016-06-22"realtime_start="2016-06-22"value="12.7"></observation><observationdate="1950-07-01"realtime_end="2016-06-22"realtime_start="2016-06-22"value="16.3"></observation></observations></body></html>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(x,"lxml")
for ob in soup.find_all("observation"):
print(ob["date"])
print(ob["value"])
Which will give you:
1947-04-01-0.41947-07-01-0.41947-10-016.41948-01-0161948-04-016.71948-07-012.31948-10-010.41949-01-01-5.41949-04-01-1.31949-07-014.51949-10-01-3.51950-01-0116.91950-04-0112.71950-07-0116.3
To write to a csv:
from bs4 import BeautifulSoup
import csv
soup = BeautifulSoup(x, "lxml")
withopen("out.csv", "w") as f:
csv.writer(f).writerows((ob["date"], ob["value"])
for ob in soup.find_all("observation"))
Which gives you a csv file with:
1947-04-01,-0.41947-07-01,-0.41947-10-01,6.41948-01-01,61948-04-01,6.71948-07-01,2.31948-10-01,0.41949-01-01,-5.41949-04-01,-1.31949-07-01,4.51949-10-01,-3.51950-01-01,16.91950-04-01,12.71950-07-01,16.3
Solution 2:
The above works great! If you want to use URL instead of local file the code will look like this:
from bs4 import BeautifulSoup
from urllib.request import urlopen
html = urlopen("https://api.stlouisfed.org/fred/series/.......")
bsObj = BeautifulSoup(html.read(), "lxml");
for ob in bsObj.find_all("observation"):
print(ob["date"])
print(ob["value"])
and for the .csv:
import csv
withopen("out.csv", "w") as f:
csv.writer(f).writerows((ob["date"], ob["value"])
for ob in bsObj.find_all("observation"))
Post a Comment for "Scraping Xml Element Attributes With Beautifulsoup"