十年网站开发经验 + 多家企业客户 + 靠谱的建站团队
量身定制 + 运营维护+专业推广+无忧售后,网站问题一站解决
这篇文章将为大家详细讲解有关在scrapy中使用selenium实现一个爬取网页的功能,文章内容质量较高,因此小编分享给大家做个参考,希望大家阅读完这篇文章后对相关知识有一定的了解。
姜堰ssl适用于网站、小程序/APP、API接口等需要进行数据传输应用场景,ssl证书未来市场广阔!成为成都创新互联的ssl证书销售渠道,可以享受市场价格4-6折优惠!如果有意向欢迎电话联系或者加微信:028-86922220(备注:SSL证书合作)期待与您的合作!1.背景
2. 环境
3.原理分析
3.1. 分析request请求的流程
首先看一下scrapy新的架构图:
部分流程:
第一:爬虫引擎生成requests请求,送往scheduler调度模块,进入等待队列,等待调度。
第二:scheduler模块开始调度这些requests,出队,发往爬虫引擎。
第三:爬虫引擎将这些requests送到下载中间件(多个,例如加header,代理,自定义等等)进行处理。
第四:处理完之后,送往Downloader模块进行下载。从这个处理过程来看,突破口就在下载中间件部分,用selenium直接处理掉request请求。
3.2. requests和response中间处理件源码分析
相关代码位置:
源码解析:
# 文件:E:\Miniconda\Lib\site-packages\scrapy\core\downloader\middleware.py """ Downloader Middleware manager See documentation in docs/topics/downloader-middleware.rst """ import six from twisted.internet import defer from scrapy.http import Request, Response from scrapy.middleware import MiddlewareManager from scrapy.utils.defer import mustbe_deferred from scrapy.utils.conf import build_component_list class DownloaderMiddlewareManager(MiddlewareManager): component_name = 'downloader middleware' @classmethod def _get_mwlist_from_settings(cls, settings): # 从settings.py或这custom_setting中拿到自定义的Middleware中间件 ''' 'DOWNLOADER_MIDDLEWARES': { 'mySpider.middlewares.ProxiesMiddleware': 400, # SeleniumMiddleware 'mySpider.middlewares.SeleniumMiddleware': 543, 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, }, ''' return build_component_list( settings.getwithbase('DOWNLOADER_MIDDLEWARES')) # 将所有自定义Middleware中间件的处理函数添加到对应的methods列表中 def _add_middleware(self, mw): if hasattr(mw, 'process_request'): self.methods['process_request'].append(mw.process_request) if hasattr(mw, 'process_response'): self.methods['process_response'].insert(0, mw.process_response) if hasattr(mw, 'process_exception'): self.methods['process_exception'].insert(0, mw.process_exception) # 整个下载流程 def download(self, download_func, request, spider): @defer.inlineCallbacks def process_request(request): # 处理request请求,依次经过各个自定义Middleware中间件的process_request方法,前面有加入到list中 for method in self.methods['process_request']: response = yield method(request=request, spider=spider) assert response is None or isinstance(response, (Response, Request)), \ 'Middleware %s.process_request must return None, Response or Request, got %s' % \ (six.get_method_self(method).__class__.__name__, response.__class__.__name__) # 这是关键地方 # 如果在某个Middleware中间件的process_request中处理完之后,生成了一个response对象 # 那么会直接将这个response return 出去,跳出循环,不再处理其他的process_request # 之前我们的header,proxy中间件,都只是加个user-agent,加个proxy,并不做任何return值 # 还需要注意一点:就是这个return的必须是Response对象 # 后面我们构造的HtmlResponse正是Response的子类对象 if response: defer.returnValue(response) # 如果在上面的所有process_request中,都没有返回任何Response对象的话 # 最后,会将这个加工过的Request送往download_func,进行下载,返回的就是一个Response对象 # 然后依次经过各个Middleware中间件的process_response方法进行加工,如下 defer.returnValue((yield download_func(request=request,spider=spider))) @defer.inlineCallbacks def process_response(response): assert response is not None, 'Received None in process_response' if isinstance(response, Request): defer.returnValue(response) for method in self.methods['process_response']: response = yield method(request=request, response=response, spider=spider) assert isinstance(response, (Response, Request)), \ 'Middleware %s.process_response must return Response or Request, got %s' % \ (six.get_method_self(method).__class__.__name__, type(response)) if isinstance(response, Request): defer.returnValue(response) defer.returnValue(response) @defer.inlineCallbacks def process_exception(_failure): exception = _failure.value for method in self.methods['process_exception']: response = yield method(request=request, exception=exception, spider=spider) assert response is None or isinstance(response, (Response, Request)), \ 'Middleware %s.process_exception must return None, Response or Request, got %s' % \ (six.get_method_self(method).__class__.__name__, type(response)) if response: defer.returnValue(response) defer.returnValue(_failure) deferred = mustbe_deferred(process_request, request) deferred.addErrback(process_exception) deferred.addCallback(process_response) return deferred