Python Scrapy On Offline (local) Data
I have a 270MB dataset (10000 html files) on my computer. Can I use Scrapy to crawl this dataset locally? How?
Solution 1:
SimpleHTTP Server Hosting
If you truly want to host it locally and use scrapy, you could serve it by navigating to the directory it's stored in and run the SimpleHTTPServer (port 8000 shown below):
python-mSimpleHTTPServer8000
Then just point scrapy at 127.0.0.1:8000
$ scrapy crawl 127.0.0.1:8000
file://
An alternative is to just have scrapy point to the set of files directly:
$ scrapy crawl file:///home/sagi/html_files # Assuming you're on a *nix system
Wrapping up
Once you've set up your scraper for scrapy (see example dirbot), just run the crawler:
$ scrapy crawl 127.0.0.1:8000
If links in the html files are absolute rather than relative though, these may not work well. You would need to adjust the files yourself.
Solution 2:
Go to your Dataset folder :
import os
files = os.listdir(os.getcwd())
for file infiles:
withopen(file,"r") asf:
page_content = f.read()
#do here watever you want to dowith page_content. I guess parsing with lxml or Beautiful soup.
No need to go for Scrapy !
Post a Comment for "Python Scrapy On Offline (local) Data"