Scrapy : Program Organization When Interacting With Secondary Website
I'm working with Scrapy 1.1 and I have a project where I have spider '1' scrape site A (where I aquire 90% of the information to fill my items). However depending on the results of
Solution 1:
Both approaches are very common and this just a question of preference. For your case containing everything in one spider sounds like a straight-forward solution.
You can add url field to your item and schedule and parse it later in the pipeline:
class MyPipeline(object):
def __init__(self, crawler):
self.crawler = crawler
@classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def process_item(self, item, spider):
extra_url = item.get('extra_url', None)
if not extra_url:
return item
req = Request(url=extra_url
callback=self.custom_callback,
meta={'item': item},)
self.crawler.engine.crawl(req, spider)
# you have to drop the item here since you will return it later anyway
raise DropItem()
def custom_callback(self, response):
# retrieve your item
item = response.mete['item']
# do something to add to item
item['some_extra_stuff'] = ...
del item['extra_url']
yield item
What the above code does is checks whether item has some url field, if it does it drops the item and schedules a new request. That requests fills up the item with some extra data and sends it back to the pipeline.
Post a Comment for "Scrapy : Program Organization When Interacting With Secondary Website"