Skip to content Skip to sidebar Skip to footer

Python Scrapy How To Use Basedupefilter

I have a website have many pages like this: mywebsite/?page=1 mywebsite/?page=2 ... ... ... mywebsite/?page=n each page have links to players. when you click on any link, you

Solution 1:

I'd take another approach and try not to query for the last player during spider run, but rather launch the spider with a pre calculated argument of the last scraped player:

scrapy crawl <myspider> -a last_player=X

then your spider may look like:

classMySpider(BaseSpider):
    start_urls = ["http://....mywebsite/?page=1"]
    ...
    defparse(self, response):
        ...
        last_player_met = False
        player_links = sel.xpath(....)
        for player_link in player_links:
            player_id = player_link.split(....)
            if player_id < self.last_player:
                 yield Request(url=player_link, callback=self.scrape_player)
            else:
                last_player_met = Trueifnot last_player_met:
            # try to xpath for 'Next' in pagination # or use meta={} in request to loop over pages like # "http://....mywebsite/?page=" + page_numberyield Request(url=..., callback=self.parse)

Post a Comment for "Python Scrapy How To Use Basedupefilter"