Input/output For Scrapyd Instance Hosted On An Amazon EC2 Linux Instance

July 15, 2022 Post a Comment

Recently I began working on building web scrapers using scrapy. Originally I had deployed my scrapy projects locally using scrapyd. The scrapy project I built relies on accessing d

Solution 1:

Is S3 an option? I'm asking because you're already using EC2. If that's the case, you could read/write from/to S3.

I'm a bit confused because you mentioned both CSV and JSON formats. If you're reading CSV, you could use CSVFeedSpider. Either way, you could also use boto to read from S3 in your spider's __init__ or start_requests method.

Regarding the output, this page explains how to use feed exports to write the output of a crawl to S3.

Relevant settings:

FEED_URI
FEED_FORMAT
AWS_ACCESS_KEY_ID
AWS_ACCESS_SECRET_KEY

Getting Started with Python

Input/output For Scrapyd Instance Hosted On An Amazon EC2 Linux Instance

Solution 1:

Post a Comment for "Input/output For Scrapyd Instance Hosted On An Amazon EC2 Linux Instance"