Skip to content Skip to sidebar Skip to footer

Speed Up Web Scraper

I am scraping 23770 webpages with a pretty simple web scraper using scrapy. I am quite new to scrapy and even python, but managed to write a spider that does the job. It is, howeve

Solution 1:

Here's a collection of things to try:

  • use latest scrapy version (if not using already)
  • check if non-standard middlewares are used
  • try to increase CONCURRENT_REQUESTS_PER_DOMAIN, CONCURRENT_REQUESTS settings (docs)
  • turn off logging LOG_ENABLED = False (docs)
  • try yielding an item in a loop instead of collecting items into the items list and returning them
  • use local cache DNS (see this thread)
  • check if this site is using download threshold and limits your download speed (see this thread)
  • log cpu and memory usage during the spider run - see if there are any problems there
  • try run the same spider under scrapyd service
  • see if grequests + lxml will perform better (ask if you need any help with implementing this solution)
  • try running Scrapy on pypy, see Running Scrapy on PyPy

Hope that helps.

Solution 2:

Looking at your code, I'd say most of that time is spent in network requests rather than processing the responses. All of the tips @alecxe provides in his answer apply, but I'd suggest the HTTPCACHE_ENABLED setting, since it caches the requests and avoids doing it a second time. It would help on following crawls and even offline development. See more info in the docs: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.contrib.downloadermiddleware.httpcache

Solution 3:

One workaround to speed up your scrapy is to config your start_urls appropriately.

For example, If our target data is in http://apps.webofknowledge.com/doc=1 where the doc number range from 1 to 1000, you can config your start_urls in followings:

start_urls = [
    "http://apps.webofknowledge.com/doc=250",
    "http://apps.webofknowledge.com/doc=750",
]

In this way, requests will start from 250 to 251,249 and from 750 to 751,749 simultaneously, so you will get 4 times faster compared to start_urls = ["http://apps.webofknowledge.com/doc=1"].

Solution 4:

I work also on web scraping, using optimized C#, and it ends up CPU bound, so I am switching to C.

Parsing HTML blows the CPU data cache, and pretty sure your CPU is not using SSE 4.2 at all, as you can only access this feature using C/C++.

If you do the math, you are quickly compute bound but not memory bound.

Post a Comment for "Speed Up Web Scraper"