Missing Scheme In Request Url
Solution 1:
change start_urls
to:
self.start_urls = ["http://www.bankofwow.com/"]
Solution 2:
prepend url with 'http' or 'https'
Solution 3:
As @Guy answered earlier, start_urls
attribute must be a list, the exceptions.ValueError: Missing scheme in request url: h
message comes from that: the "h" in the error message is the first character of "http://www.bankofwow.com/", interpreted as a list (of characters)
allowed_domains
must also be a list of domains, otherwise you'll get filtered "offsite" requests.
Change restrict_xpaths
to
self.xpaths = """//td[@class="CatBg" and @width="25%"
and @valign="top"and @align="center"]
/table[@cellspacing="0"]//tr/td"""
it should represent an area in the document where to find links, it should not be link URLs directly
From http://doc.scrapy.org/en/latest/topics/link-extractors.html#sgmllinkextractor
restrict_xpaths (str or list) – is a XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links.
Finally, it's customary to define these as class attributes instead of settings those in __init__
:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.http import Request
from bow.items import BowItem
import sys
import MySQLdb
import hashlib
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
# _*_ coding: utf-8 _*_classbankOfWow_spider(CrawlSpider):
name = "bankofwow"
allowed_domains = ["bankofwow.com"]
start_urls = ["http://www.bankofwow.com/"]
xpaths = '''//td[@class="CatBg" and @width="25%"
and @valign="top" and @align="center"]
/table[@cellspacing="0"]//tr/td'''
rules = (
Rule(SgmlLinkExtractor(restrict_xpaths=(xpaths,))),
Rule(SgmlLinkExtractor(allow=('cart.php?')), callback='parse_items'),
)
def__init__(self, *a, **kw):
# catch the spider stopping# dispatcher.connect(self.spider_closed, signals.spider_closed)# dispatcher.connect(self.on_engine_stopped, signals.engine_stopped)super(bankOfWow_spider, self).__init__(*a, **kw)
defparse_items(self, response):
sel = Selector(response)
items = []
listings = sel.xpath('//*[@id="tabContent"]/table/tr')
item = IgeItem()
item["header"] = sel.xpath('//td[@valign="center"]/h1/text()')
items.append(item)
return items
Solution 4:
Scheme basically has a syntax like
scheme:[//[user:password@]host[:port]][/]path[?query][#fragment]
Examples of popular schemes include
http(s)
,ftp
,mailto
,file
,data
, andirc
. There could also be terms likeabout
orabout:blank
we are somewhat familiar with.
It's more clear in the description on that same definition page:
hierarchical part
┌───────────────────┴─────────────────────┐
authority path
┌───────────────┴───────────────┐┌───┴────┐
abc://username:password@example.com:123/path/data?key=value&key2=value2#fragid1
└┬┘ └───────┬───────┘ └────┬────┘ └┬┘ └─────────┬─────────┘ └──┬──┘
scheme user information host port query fragment
urn:example:mammal:monotreme:echidna
└┬┘ └──────────────┬───────────────┘
scheme path
In the question of Missing schemes
it appears that there is [//[user:password@]host[:port]]
part missing in
data=u'cart.php?target=category&category_id=826'
as mentioned above.
I had a similar problem where this simple concept would suffice the solution for me!
Hope this helps some.
Solution 5:
change start_urls
to:
self.start_urls = ("http://www.domainname.com/",)
it should work.
Post a Comment for "Missing Scheme In Request Url"