Recursive Scraping Craigslist With Scrapy And Python 2.7

July 28, 2023 Post a Comment

I'm having trouble getting the spider to follow the next page of ads without following every link it finds, eventually returning every craigslist page. I've played around with the

Solution 1:

You should specify an allow argument of SgmlLinkExtractor:

allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.

rules = (
    Rule(SgmlLinkExtractor(allow='http://medford.craigslist.org/cto/'), 
         callback='parse_page', follow=True),
)

This will keep all links under http://medford.craigslist.org/cto/ url.

Hope that helps.

Getting Started with Python

Recursive Scraping Craigslist With Scrapy And Python 2.7

Solution 1:

Post a Comment for "Recursive Scraping Craigslist With Scrapy And Python 2.7"