Scraping Paginated Web Table With Python Pandas & Beautifulsoup
I am a beginner in python pandas, i am trying to scrap a paginated table using beautiful soup package, the data is scraped, but the content of each cell comes in a single row, i co
Solution 1:
i did some cleaning. first, why the bytes type? you're writing text. then why ascii? please use unicode. if later in your code you really need ascii encode to ascii then. the use of findAll
is deprecated, please use find_all
. you had also a possible issue with commas in the surface value. finally, always use context managers when possible (here: working with files)
and now for your question, you had two problems:
- your test
if len(saverec)!=0:
was in the for-loop, generating lots of useless data. - you were not stripping the data of its unneeded whitespaces
.
import urllib
import urllib.requestfrom bs4 import BeautifulSoup
import os
def make_soup(url):
thepage=urllib.request.urlopen(url)
soupdata=BeautifulSoup(thepage,"html.parser")
return soupdata
save=""
for num in range(0, 22):
soup=make_soup("http://www.ceetrus.com/fr/implantations-sites-commerciaux?page="+str(num))
for rec in soup.find_all('tr'):
saverec=""
for data in rec.find_all('td'):
data = data.text.strip()
if "," in data:
data = data.replace(",", "")
saverec=saverec+","+data
if len(saverec)!=0:
save=save+"\n"+saverec[1:]
print('#%d done' % num)
headers="Nom_commercial_du_Site,Ville,Etat,Surface_GLA,Nombre_de_boutique,Contact"
with open(os.path.expanduser("sites_commerciaux.csv"), "w") as csv_file:
csv_file.write(headers)
csv_file.write(save)
which outputs for the first page:
Nom_commercial_du_Site,Ville,Etat,Surface_GLA,Nombre_de_boutique,Contact
ALCORCÓN,ALCORCÓN - MADRID,Ouvert,4298 m²,40,José Carlos GARCIA
Alegro Alfragide,CARNAXIDE,Ouvert,11461 m²,122,
Alegro Castelo Branco,CASTELO BRANCO,Ouvert,6830 m²,55,
Alegro Setúbal,Setúbal,Ouvert,27000 m²,114,
Ancona,Ancona,Ouvert,7644 m²,41,Ettore PAPPONETTI
Angoulême La Couronne,LA COURONNE,Ouvert,6141 m²,45,Juliette GALLOUEDEC
Annecy Grand Epagny,EPAGNY,Ouvert,20808 m²,61,Delphine BENISTY
Anping,Tainan,Ouvert,969 m²,21,Roman LEE
АКВАРЕЛЬ,Volgograd,Ouvert,94025 m²,182,Viktoria ZAITSEVA
Arras,ARRAS,Ouvert,4000 m²,26,Anais NIZON
Post a Comment for "Scraping Paginated Web Table With Python Pandas & Beautifulsoup"