Python Scrapy How To Use Basedupefilter
I have a website have many pages like this: mywebsite/?page=1 mywebsite/?page=2 ... ... ... mywebsite/?page=n each page have links to players. when you click on any link, you
Solution 1:
I'd take another approach and try not to query for the last player during spider run, but rather launch the spider with a pre calculated argument of the last scraped player:
scrapy crawl <myspider> -a last_player=X
then your spider may look like:
classMySpider(BaseSpider):
start_urls = ["http://....mywebsite/?page=1"]
...
defparse(self, response):
...
last_player_met = False
player_links = sel.xpath(....)
for player_link in player_links:
player_id = player_link.split(....)
if player_id < self.last_player:
yield Request(url=player_link, callback=self.scrape_player)
else:
last_player_met = Trueifnot last_player_met:
# try to xpath for 'Next' in pagination # or use meta={} in request to loop over pages like # "http://....mywebsite/?page=" + page_numberyield Request(url=..., callback=self.parse)
Post a Comment for "Python Scrapy How To Use Basedupefilter"