ReactorNotRestartable With Scrapy When Using Google Cloud Functions
Solution 1:
By default, the asynchronous nature of scrapy
is not going to work well with Cloud Functions, as we'd need a way to block on the crawl to prevent the function from returning early and the instance being killed before the process terminates.
Instead, we can use scrapydo
to run your existing spider in a blocking fashion:
requirements.txt
:
scrapydo
main.py
:
import scrapy
import scrapydo
scrapydo.setup()
class MyItem(scrapy.Item):
url = scrapy.Field()
class MySpider(scrapy.Spider):
name = "example.com"
allowed_domains = ["example.com"]
start_urls = ["http://example.com/"]
def parse(self, response):
yield MyItem(url=response.url)
def run_single_crawl(data, context):
results = scrapydo.run_spider(MySpider)
This also shows a simple example of how to yield one or more scrapy.Item
from the spider and collect the results from the crawl, which would also be challenging to do if not using scrapydo
.
Also: make sure that you have billing enabled for your project. By default Cloud Functions cannot make outbound requests, and the crawler will succeed, but return no results.
Solution 2:
You can simply crawl the spider in a sequence.
main.py
from scrapy.crawler import CrawlerProcess
def run_single_crawl(data, context):
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start()
Post a Comment for "ReactorNotRestartable With Scrapy When Using Google Cloud Functions"