Speed Up Web Scraper
Solution 1:
Here's a collection of things to try:
- use latest scrapy version (if not using already)
- check if non-standard middlewares are used
- try to increase
CONCURRENT_REQUESTS_PER_DOMAIN
,CONCURRENT_REQUESTS
settings (docs) - turn off logging
LOG_ENABLED = False
(docs) - try
yield
ing an item in a loop instead of collecting items into theitems
list and returning them - use local cache DNS (see this thread)
- check if this site is using download threshold and limits your download speed (see this thread)
- log cpu and memory usage during the spider run - see if there are any problems there
- try run the same spider under scrapyd service
- see if grequests + lxml will perform better (ask if you need any help with implementing this solution)
- try running
Scrapy
onpypy
, see Running Scrapy on PyPy
Hope that helps.
Solution 2:
Looking at your code, I'd say most of that time is spent in network requests rather than processing the responses. All of the tips @alecxe provides in his answer apply, but I'd suggest the HTTPCACHE_ENABLED
setting, since it caches the requests and avoids doing it a second time. It would help on following crawls and even offline development. See more info in the docs: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.contrib.downloadermiddleware.httpcache
Solution 3:
One workaround to speed up your scrapy is to config your start_urls
appropriately.
For example, If our target data is in http://apps.webofknowledge.com/doc=1
where the doc number range from 1
to 1000
, you can config your start_urls
in followings:
start_urls = [
"http://apps.webofknowledge.com/doc=250",
"http://apps.webofknowledge.com/doc=750",
]
In this way, requests will start from 250
to 251,249
and from 750
to 751,749
simultaneously, so you will get 4 times faster compared to start_urls = ["http://apps.webofknowledge.com/doc=1"]
.
Solution 4:
I work also on web scraping, using optimized C#, and it ends up CPU bound, so I am switching to C.
Parsing HTML blows the CPU data cache, and pretty sure your CPU is not using SSE 4.2 at all, as you can only access this feature using C/C++.
If you do the math, you are quickly compute bound but not memory bound.
Post a Comment for "Speed Up Web Scraper"