Skip to content Skip to sidebar Skip to footer

Fetching Content From Html And Write Fetched Content In A Specific Format In Csv

I have HTML Code like:

Organi

Solution 1:

I try a bit modify original solution - best is loop only once and create one big DataFrame with all data. then only select columns with subset [['col1','col2'] for new DataFrames.

Also for remove numbers with () is possible use str.replace:

for i in webpage_urls:
    wiki2 = i
    page= urllib.request.urlopen(wiki2)
    soup = BeautifulSoup(page, "lxml")

    lobbying = {}
    #always only 2 active li, so select first by [0]  and second by [1]
    org = soup.find_all('li', class_="nav-item active")[0].span.get_text()
    groups = soup.find_all('li', class_="nav-item active")[1].span.get_text()

    data2 = soup.find_all('h3', class_="dataset-heading")
    for element in data2:
        lobbying[element.a.get_text()] = {}
    data2[0].a["href"]
    prefix = "https://data.gov.au"for element in data2:
        lobbying[element.a.get_text()]["link"] = prefix + element.a["href"]
        lobbying[element.a.get_text()]["Organisation"] = org
        lobbying[element.a.get_text()]["Group"] = groups#print(lobbying)df = pd.DataFrame.from_dict(lobbying, orient='index') \
               .rename_axis('Titles').reset_index()
        dfs.append(df)
df = pd.concat(dfs, ignore_index=True)
df1 = df.drop_duplicates(subset = 'Titles').reset_index(drop=True)



df1['Organisation'] = df1['Organisation'].str.replace('\(\d+\)', '')
df1['Group'] = df1['Group'].str.replace('\(\d+\)', '')

print (df1.head())
                                              Titles             Organisation  \
0                                     Banks – Assets  Reserve Bank of Aus...    
1  Consolidated Exposures – Immediate and Ultimat...  Reserve Bank of Aus...    
2  Foreign Exchange Transactions and Holdings of ...  Reserve Bank of Aus...    
3  Finance Companies and General Financiers – Sel...  Reserve Bank of Aus...    
4                   Liabilities and Assets – Monthly  Reserve Bank of Aus...    

                                                link                    Group  
0           https://data.gov.au/dataset/banks-assets  Business Support an...   
1  https://data.gov.au/dataset/consolidated-expos...  Business Support an...   
2  https://data.gov.au/dataset/foreign-exchange-t...  Business Support an...   
3  https://data.gov.au/dataset/finance-companies-...  Business Support an...   
4  https://data.gov.au/dataset/liabilities-and-as...  Business Support an...   

df2 = df1[['Titles', 'link']]
print (df2.head())
                                              Titles  \
0                                     Banks – Assets   
1  Consolidated Exposures – Immediate and Ultimat...   
2  Foreign Exchange Transactions and Holdings of ...   
3  Finance Companies and General Financiers – Sel...   
4                   Liabilities and Assets – Monthly   

                                                link  
0           https://data.gov.au/dataset/banks-assets  
1  https://data.gov.au/dataset/consolidated-expos...  
2  https://data.gov.au/dataset/foreign-exchange-t...  
3  https://data.gov.au/dataset/finance-companies-...  
4  https://data.gov.au/dataset/liabilities-and-as...  

df3 = df1[['Group','Organisation','Titles']]print (df3.head())
                     Group             Organisation  \
0  Business Support an...   Reserve Bank of Aus...    
1  Business Support an...   Reserve Bank of Aus...    
2  Business Support an...   Reserve Bank of Aus...    
3  Business Support an...   Reserve Bank of Aus...    
4  Business Support an...   Reserve Bank of Aus...    

                                              Titles  
0                                     Banks – Assets  
1  Consolidated Exposures – Immediate and Ultimat...  
2  Foreign Exchange Transactions and Holdings of ...  
3  Finance Companies and General Financiers – Sel...  
4                   Liabilities and Assets – Monthly  

Post a Comment for "Fetching Content From Html And Write Fetched Content In A Specific Format In Csv"