Skip to content Skip to sidebar Skip to footer

Loop Url From Dataframe And Download Pdf Files In Python

Based on the code from here, I'm able to crawler url for each transation and save them into an excel file which can be downloaded here. Now I would like to go further and click the

Solution 1:

An example of download a pdf file in your uploaded excel file.

from bs4 import BeautifulSoup
import requests

# Let's assume there is only one page.If you need to download many files, save them in a list.

url = 'http://xinsanban.eastmoney.com/Article/NoticeContent?id=AN201909041348533085'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")

link = soup.select_one(".lookmore")
title = soup.select_one(".newsContent").select_one("h1").text

print(title.strip() + '.pdf')
data = requests.get(link.get("href")).content
with open(title.strip().replace(":", "-") + '.pdf', "wb+") as f: # file name shouldn't contain ':', so I replace it to "-"
    f.write(data)

And download successfully:

enter image description here


Solution 2:

Here's bit different approach. You don't have to open those urls from the excel file as you can build the .pdf file source urls yourself.

For example:

import requests

urls = [
    "http://data.eastmoney.com/notices/detail/871792/AN201909041348533085,JWU2JWEwJTk2JWU5JTljJTllJWU3JTg5JWE5JWU0JWI4JTlh.html",
    "http://data.eastmoney.com/notices/detail/872955/AN201912101371726768,JWU0JWI4JWFkJWU5JTgzJWJkJWU3JTg5JWE5JWU0JWI4JTlh.html",
    "http://data.eastmoney.com/notices/detail/832816/AN202008171399155565,JWU3JWI0JWEyJWU1JTg1JThiJWU3JTg5JWE5JWU0JWI4JTlh.html",
    "http://data.eastmoney.com/notices/detail/831971/AN201505220009713696,JWU1JWJjJTgwJWU1JTg1JTgzJWU3JTg5JWE5JWU0JWI4JTlh.html",
]

for url in urls:
    file_id, _ = url.split('/')[-1].split(',')
    pdf_file_url = f"http://pdf.dfcfw.com/pdf/H2_{file_id}_1.pdf"
    print(f"Fetching {pdf_file_url}...")
    with open(f"{file_id}.pdf", "wb") as f:
        f.write(requests.get(pdf_file_url).content)

Post a Comment for "Loop Url From Dataframe And Download Pdf Files In Python"