Fetching Content From Html And Write Fetched Content In A Specific Format In Csv
I have HTML Code like:
Organi
Solution 1:
I try a bit modify original solution - best is loop only once and create one big DataFrame
with all data. then only select columns with subset [['col1','col2']
for new DataFrames
.
Also for remove numbers with ()
is possible use str.replace
:
for i in webpage_urls:
wiki2 = i
page= urllib.request.urlopen(wiki2)
soup = BeautifulSoup(page, "lxml")
lobbying = {}
#always only 2 active li, so select first by [0] and second by [1]
org = soup.find_all('li', class_="nav-item active")[0].span.get_text()
groups = soup.find_all('li', class_="nav-item active")[1].span.get_text()
data2 = soup.find_all('h3', class_="dataset-heading")
for element in data2:
lobbying[element.a.get_text()] = {}
data2[0].a["href"]
prefix = "https://data.gov.au"for element in data2:
lobbying[element.a.get_text()]["link"] = prefix + element.a["href"]
lobbying[element.a.get_text()]["Organisation"] = org
lobbying[element.a.get_text()]["Group"] = groups#print(lobbying)df = pd.DataFrame.from_dict(lobbying, orient='index') \
.rename_axis('Titles').reset_index()
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
df1 = df.drop_duplicates(subset = 'Titles').reset_index(drop=True)
df1['Organisation'] = df1['Organisation'].str.replace('\(\d+\)', '')
df1['Group'] = df1['Group'].str.replace('\(\d+\)', '')
print (df1.head())
Titles Organisation \
0 Banks – Assets Reserve Bank of Aus...
1 Consolidated Exposures – Immediate and Ultimat... Reserve Bank of Aus...
2 Foreign Exchange Transactions and Holdings of ... Reserve Bank of Aus...
3 Finance Companies and General Financiers – Sel... Reserve Bank of Aus...
4 Liabilities and Assets – Monthly Reserve Bank of Aus...
link Group
0 https://data.gov.au/dataset/banks-assets Business Support an...
1 https://data.gov.au/dataset/consolidated-expos... Business Support an...
2 https://data.gov.au/dataset/foreign-exchange-t... Business Support an...
3 https://data.gov.au/dataset/finance-companies-... Business Support an...
4 https://data.gov.au/dataset/liabilities-and-as... Business Support an...
df2 = df1[['Titles', 'link']]
print (df2.head())
Titles \
0 Banks – Assets
1 Consolidated Exposures – Immediate and Ultimat...
2 Foreign Exchange Transactions and Holdings of ...
3 Finance Companies and General Financiers – Sel...
4 Liabilities and Assets – Monthly
link
0 https://data.gov.au/dataset/banks-assets
1 https://data.gov.au/dataset/consolidated-expos...
2 https://data.gov.au/dataset/foreign-exchange-t...
3 https://data.gov.au/dataset/finance-companies-...
4 https://data.gov.au/dataset/liabilities-and-as...
df3 = df1[['Group','Organisation','Titles']]print (df3.head())
Group Organisation \
0 Business Support an... Reserve Bank of Aus...
1 Business Support an... Reserve Bank of Aus...
2 Business Support an... Reserve Bank of Aus...
3 Business Support an... Reserve Bank of Aus...
4 Business Support an... Reserve Bank of Aus...
Titles
0 Banks – Assets
1 Consolidated Exposures – Immediate and Ultimat...
2 Foreign Exchange Transactions and Holdings of ...
3 Finance Companies and General Financiers – Sel...
4 Liabilities and Assets – Monthly
Post a Comment for "Fetching Content From Html And Write Fetched Content In A Specific Format In Csv"