1
qsl0913 2014-09-01 14:55:26 +08:00
推荐方案一的做法,debug一下,看看middleware执行了没有,为什么不生效。
这里提供一段做多proxy随机的middleware,修改为判断request.url 即可做到不同domain采用不同的proxy。 settings.py ` DOWNLOADER_MIDDLEWARES = { 'dp.spiders.downloadermiddleware.ProxyMiddleware' : 400 , } downloadermiddleware.py ` class ProxyMiddleware(object): def process_request(self, request, spider): proxy = random.choice(self.proxy_list) if proxy: request.meta['proxy'] = proxy log.msg('Current proxy: '+proxy, level='INFO') proxy_list = settings.get('PROXY_LIST') ` |
2
forever139 2014-09-17 17:27:15 +08:00
其实作者有提供的,
class SelectiveProxyMiddleware(object): """A middleware to enable http proxy to selected spiders only. Settings: HTTP_PROXY -- proxy uri. e.g.: http://user:[email protected]:port PROXY_SPIDERS -- all requests from these spiders will be routed through the proxy """ def __init__(self, settings): self.proxy = self.parse_proxy(settings.get('HTTP_PROXY'), 'http') self.proxy_spiders = set(settings.getlist('PROXY_SPIDERS', [])) @classmethod def from_crawler(cls, crawler): return cls(crawler.settings) def parse_proxy(self, url, orig_type): proxy_type, user, password, hostport = _parse_proxy(url) proxy_url = urlunparse((proxy_type or orig_type, hostport, '', '', '', '')) if user and password: user_pass = '%s:%s' % (unquote(user), unquote(password)) creds = base64.b64encode(user_pass).strip() else: creds = None return creds, proxy_url def process_request(self, request, spider): if spider.name in self.proxy_spiders: creds, proxy = self.proxy request.meta['proxy'] = proxy if creds: request.headers['Proxy-Authorization'] = 'Basic ' + creds 然后再在你的settings.py里配置: HTTP_PROXY='your_proxy' PROXY_SPIDERS=[your_spider_names] 不过整体和你的思路一样 |