Showing Two Différents Errors With The Same Code That I Used To Scrape Other Pages

July 02, 2024 Post a Comment

I used a code to scrape two pages from tripadvisor, and it worked very well. But now, it shows me two differents errors : with open('iletaitunsquare1.csv', 'w', encoding='utf-8-s

Solution 1:

This is because in the original code, not posted here, it was relying on Truthy/Falsy value of offset 0 which in your prior question was the first offset.

For example, with:

foroffsetinrange(0, 10, 10):
    if notoffset:

The first value 0 is a Falsy versus numbers > 0 (in this scenario) which will be seen as Truthy. If not True i.e. False i.e. if 0 offset then set the value of inf_rest_name. This ensures its value it only set on the first loop rather than each time. Its value doesn't change so no need to read again.

With the following all values are Truthies and so inf_rest_name never gets set.

foroffsetinrange(40, 290, 10):
    if notoffset:

You could change to:

ifoffset== firstvalue:

e.g.

ifoffset== 40:
    inf_rest_name = soup.select_one('.heading').text.replace("\n","").strip()
    rest_eclf = soup.select_one('.header_links a').text.strip()

See this for more info.

Those lines also need to work with first soup not later soup (as that is only reviews)

import requests
from bs4 import BeautifulSoup as bs

with requests.Session() as s:
        for offset inrange(40, 290, 10):
            url = f'https://www.tripadvisor.fr/Restaurant_Review-g187147-d9783452-Reviews-or{offset}-Boutary-Paris_Ile_de_France.html'
            r = s.get(url)
            soup = bs(r.content, 'lxml')
            if offset == 40:
                inf_rest_name = soup.select_one('.heading').text.replace("\n","").strip()
                rest_eclf = soup.select_one('.header_links a').text.strip()
            reviews = soup.select('.reviewSelector')
            ids = [review.get('data-reviewid') for review in reviews]
            r = s.post(
                'https://www.tripadvisor.fr/OverlayWidgetAjax?Mode=EXPANDED_HOTEL_REVIEWS_RESP&metaReferer=',
                data = {'reviews': ','.join(ids), 'contextChoice': 'DETAIL'},
                headers = {'referer': r.url}
                )

            soup = bs(r.content, 'lxml')

            for review in soup.select('.reviewSelector'):
                name_client = review.select_one('.info_text > div:first-child').text.strip()
                date_rev_cl = review.select_one('.ratingDate')['title'].strip()
                titre_rev_cl = review.select_one('.noQuotes').text.strip()
                opinion_cl = review.select_one('.partial_entry').text.replace("\n","").strip()
                row = [f"{inf_rest_name}", f"{rest_eclf}", f"{name_client}", f"{date_rev_cl}" , f"{titre_rev_cl}", f"{opinion_cl}"]

For your first code block you are using an invalid attribute. It should be

ids = [review.get('data-reviewid') for review in reviews]

Note I have added an is None test to handle not found. This should be added to top version as well.

import requests
from bs4 import BeautifulSoup as bs

with requests. Session() as s:
        for offset inrange (270, 1230, 10):
            url = f'https://www.tripadvisor.fr/Restaurant_Review-g187147-d6575305-Reviews-or{offset}-Il_Etait_Un_Square-Paris_Ile_de_France.html'
            r = s.get(url)
            soup = bs(r.content, 'lxml')
            if offset == 270:
                inf_rest_name = soup.select_one('.heading').text.replace("\n","").strip()
                rest_eclf = soup.select_one('.header_links a').text.strip()
            reviews = soup.select('.reviewSelector')
            ids = [review.get('data-reviewid') for review in reviews]
            r = s.post(
                    'https://www.tripadvisor.fr/OverlayWidgetAjax?Mode=EXPANDED_HOTEL_REVIEWS_RESP&metaReferer=',
                    data = {'reviews': ','.join(ids), 'contextChoice': 'DETAIL'},
                    headers = {'Referer': r.url}
                    )

            soup = bs(r.content, 'lxml')

            for review in soup.select('.reviewSelector'):
                name_client= review.select_one('.info_text > div:first-child')
                if name_client isNone:
                    name_client = 'N/A'else:
                    name_client = name_client.text.strip()

                date_rev_cl = review.select_one('.ratingDate')
                if date_rev_cl isNone:
                    date_rev_cl = 'N/A'else:
                    date_rev_cl  = date_rev_cl['title'].strip()

                titre_rev_cl = review.select_one('.noQuotes')
                if titre_rev_cl isNone:
                    titre_rev_cl = 'N/A'else:
                    titre_rev_cl = titre_rev_cl.text.strip()

                opinion_cl = review.select_one('.partial_entry')
                if opinion_cl isNone:
                     opinion_cl = 'N/A'else:
                     opinion_cl =  opinion_cl.text.replace("\n","").strip()

                row = [f"{inf_rest_name}", f"{rest_eclf}", f"{name_client}", f"{date_rev_cl}", f"{titre_rev_cl}", f"{opinion_cl}"]
                print(row)

Getting Started with Python

Showing Two Différents Errors With The Same Code That I Used To Scrape Other Pages

Solution 1:

Post a Comment for "Showing Two Différents Errors With The Same Code That I Used To Scrape Other Pages"