Speed Up Web Scraper

July 09, 2024 Post a Comment

I am scraping 23770 webpages with a pretty simple web scraper using scrapy. I am quite new to scrapy and even python, but managed to write a spider that does the job. It is, howeve

Solution 1:

Here's a collection of things to try:

use latest scrapy version (if not using already)
check if non-standard middlewares are used
try to increase CONCURRENT_REQUESTS_PER_DOMAIN, CONCURRENT_REQUESTS settings (docs)
turn off logging LOG_ENABLED = False (docs)
try yielding an item in a loop instead of collecting items into the items list and returning them
use local cache DNS (see this thread)
check if this site is using download threshold and limits your download speed (see this thread)
log cpu and memory usage during the spider run - see if there are any problems there
try run the same spider under scrapyd service
see if grequests + lxml will perform better (ask if you need any help with implementing this solution)
try running Scrapy on pypy, see Running Scrapy on PyPy

Hope that helps.

Solution 2:

Looking at your code, I'd say most of that time is spent in network requests rather than processing the responses. All of the tips @alecxe provides in his answer apply, but I'd suggest the HTTPCACHE_ENABLED setting, since it caches the requests and avoids doing it a second time. It would help on following crawls and even offline development. See more info in the docs: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.contrib.downloadermiddleware.httpcache

Solution 3:

One workaround to speed up your scrapy is to config your start_urls appropriately.

For example, If our target data is in http://apps.webofknowledge.com/doc=1 where the doc number range from 1 to 1000, you can config your start_urls in followings:

start_urls = [
    "http://apps.webofknowledge.com/doc=250",
    "http://apps.webofknowledge.com/doc=750",
]

In this way, requests will start from 250 to 251,249 and from 750 to 751,749 simultaneously, so you will get 4 times faster compared to start_urls = ["http://apps.webofknowledge.com/doc=1"].

Baca Juga

Solution 4:

I work also on web scraping, using optimized C#, and it ends up CPU bound, so I am switching to C.

Parsing HTML blows the CPU data cache, and pretty sure your CPU is not using SSE 4.2 at all, as you can only access this feature using C/C++.

If you do the math, you are quickly compute bound but not memory bound.

Getting Started with Python

Speed Up Web Scraper

Solution 1:

Solution 2:

Solution 3:

Solution 4:

Post a Comment for "Speed Up Web Scraper"