Skip to content Skip to sidebar Skip to footer

Recursive Scraping Craigslist With Scrapy And Python 2.7

I'm having trouble getting the spider to follow the next page of ads without following every link it finds, eventually returning every craigslist page. I've played around with the

Solution 1:

You should specify an allow argument of SgmlLinkExtractor:

allow (a regular expression (or list of)) – a single regular expression (or list of regular expressions) that the (absolute) urls must match in order to be extracted. If not given (or empty), it will match all links.

rules = (
    Rule(SgmlLinkExtractor(allow='http://medford.craigslist.org/cto/'), 
         callback='parse_page', follow=True),
)

This will keep all links under http://medford.craigslist.org/cto/ url.

Hope that helps.

Post a Comment for "Recursive Scraping Craigslist With Scrapy And Python 2.7"