笔记-scrapy-请求-下载-结果处理流程
笔记-scrapy-请求-下载-结果处理流程
在使用时发现对scrpy的下载过程中的处理逻辑还是不太明晰,-写个文档温习一下。
1. 请求-下载-结果处理流程
从哪开始呢?
engine.py
@defer.inlineCallbacks
def open_spider(self, spider, start_requests=(), close_if_idle=True):
assert self.has_capacity(), "No free spider slot when opening %r" % \
spider.name
logger.info("Spider opened", extra={'spider': spider})
nextcall = CallLaterOnce(self._next_request, spider)
scheduler = self.scheduler_cls.from_crawler(self.crawler)
start_requests = yield self.scraper.spidermw.process_start_requests(start_requests, spider)
slot = Slot(start_requests, close_if_idle, nextcall, scheduler)
self.slot = slot
self.spider = spider
yield scheduler.open(spider)
yield self.scraper.open_spider(spider)
self.crawler.stats.open_spider(spider)
yield self.signals.send_catch_log_deferred(signals.spider_opened, spider=spider)
slot.nextcall.schedule() #
slot.heartbeat.start(5)
注意最后两句
nextcall 是自己写的一个reactor调用中间类,在这里实际是把self._next_request加入了reactor task队列。
slot.nextcall.schedule() 等于调用self._next_request
slot.heartbeat.start(5)声明每5秒调用一次nextcall.schedule
reactor是跑起来了
看到self._next_request
控制循环
def _next_request(self, spider):
slot = self.slot
if not slot:
return
if self.paused:
return
while not self._needs_backout(spider):
if not self._next_request_from_scheduler(spider):
break
if slot.start_requests and not self._needs_backout(spider):
try:
request = next(slot.start_requests)
except StopIteration:
slot.start_requests = None
except Exception:
slot.start_requests = None
logger.error('Error while obtaining start requests',
exc_info=True, extra={'spider': spider})
else:
self.crawl(request, spider)
if self.spider_is_idle(spider) and slot.close_if_idle:
self._spider_idle(spider)
_needs_backout是用于判断爬虫状态的函数
def _next_request_from_scheduler(self, spider):
slot = self.slot
request = slot.scheduler.next_request()
if not request:
return
d = self._download(request, spider)
d.addBoth(self._handle_downloader_output, request, spider)
d.addErrback(lambda f: logger.info('Error while handling downloader output',
exc_info=failure_to_exc_info(f),
extra={'spider': spider}))
d.addBoth(lambda _: slot.remove_request(request))
d.addErrback(lambda f: logger.info('Error while removing request from slot',
exc_info=failure_to_exc_info(f),
extra={'spider': spider}))
d.addBoth(lambda _: slot.nextcall.schedule())
d.addErrback(lambda f: logger.info('Error while scheduling new request',
exc_info=failure_to_exc_info(f),
extra={'spider': spider}))
return d
d 是一个defer对象,然后为它添加了一堆回调函数,包括后续处理,从请求队列中删除,取下一个请求(执行nextcall.schedule(),也就是self._next_request())
Slot是当前请求的状态保存类;
下面要分两条线了,一条是下载_download(单列章节),一条是下载返回结果处理。
下载结果处理:
def _handle_downloader_output(self, response, request, spider):
assert isinstance(response, (Request, Response, Failure)), response
# downloader middleware can return requests (for example, redirects)
if isinstance(response, Request):
self.crawl(response, spider)
return
# response is a Response or Failure
d = self.scraper.enqueue_scrape(response, request, spider)
d.addErrback(lambda f: logger.error('Error while enqueuing downloader output',
exc_info=failure_to_exc_info(f),
extra={'spider': spider}))
return d
继续d = self.scraper.enqueue_scrape(response, request, spider)
def enqueue_scrape(self, response, request, spider):
slot = self.slot
dfd = slot.add_response_request(response, request)
def finish_scraping(_):
slot.finish_response(response, request) # slot状态更新
self._check_if_closing(spider, slot)
self._scrape_next(spider, slot) # 下一步
return _
dfd.addBoth(finish_scraping)
dfd.addErrback(
lambda f: logger.error('Scraper bug processing %(request)s',
{'request': request},
exc_info=failure_to_exc_info(f),
extra={'spider': spider}))
self._scrape_next(spider, slot) #下一步
return dfd
照例:slot用于保存scraper当前任务队列及状态
继续:
def _scrape_next(self, spider, slot):
while slot.queue:
response, request, deferred = slot.next_response_request_deferred()
self._scrape(response, request, spider).chainDeferred(deferred)
def _scrape(self, response, request, spider):
"""Handle the downloaded response or failure through the spider
callback/errback"""
assert isinstance(response, (Response, Failure))
dfd = self._scrape2(response, request, spider) # returns spiders processed output
dfd.addErrback(self.handle_spider_error, request, response, spider)
dfd.addCallback(self.handle_spider_output, request, response, spider)
return dfd
添加回调函数dfd.addCallback(self.handle_spider_output, request, response, spider)
注意:dfd = self._scrape2(response, request, spider) # returns spiders processed output会完成添加spider解析的回调函数。
def call_spider(self, result, request, spider):
result.request = request
dfd = defer_result(result)
dfd.addCallbacks(request.callback or spider.parse, request.errback)
return dfd.addCallback(iterate_spider_output)
关键句:dfd.addCallbacks(request.callback or spider.parse, request.errback)
继续看self.handle_spider_output,
def handle_spider_output(self, result, request, response, spider):
if not result:
return defer_succeed(None)
it = iter_errback(result, self.handle_spider_error, request, response, spider)
dfd = parallel(it, self.concurrent_items,
self._process_spidermw_output, request, response, spider)
return dfd
def _process_spidermw_output(self, output, request, response, spider):
"""Process each Request/Item (given in the output parameter) returned
from the given spider
"""
if isinstance(output, Request):
self.crawler.engine.crawl(request=output, spider=spider)
elif isinstance(output, (BaseItem, dict)):
self.slot.itemproc_size += 1
dfd = self.itemproc.process_item(output, spider)
dfd.addBoth(self._itemproc_finished, output, response, spider)
return dfd
elif output is None:
pass
else:
typename = type(output).__name__
logger.error('Spider must return Request, BaseItem, dict or None, '
'got %(typename)r in %(request)s',
{'request': request, 'typename': typename},
extra={'spider': spider})
如果返回的是请求,crawl()
如果是item或dict,self.itemproc.process_item(output, spider);也就是pipeline中写的process_item方法了
1.1. 小结
代码中对功能函数的拆分,调用还是有一点复杂的,但这样做的好处是代码块之间的耦合性不高,可以非常方便的进行某一函数或功能块的替换。
2. 下载器
下载流程解析
从engine.py中开始:
def _download(self, request, spider):
slot = self.slot
# slot 状态更新
slot.add_request(request)
def _on_success(response):
assert isinstance(response, (Response, Request))
if isinstance(response, Response):
response.request = request # tie request to response received
logkws = self.logformatter.crawled(request, response, spider)
logger.log(*logformatter_adapter(logkws), extra={'spider': spider})
self.signals.send_catch_log(signal=signals.response_received, \
response=response, request=request, spider=spider)
return response
def _on_complete(_):
slot.nextcall.schedule()
return _
dwld = self.downloader.fetch(request, spider)
dwld.addCallbacks(_on_success)
dwld.addBoth(_on_complete)
return dwld
主要是声明了一个deffer并添加了一些回调函数用于状态更新,信号传递。
dwld = self.downloader.fetch(request, spider)
进入downloader.fetch()
def fetch(self, request, spider):
def _deactivate(response):
self.active.remove(request)
return response
self.active.add(request)
dfd = self.middleware.download(self._enqueue_request, request, spider)
return dfd.addBoth(_deactivate)
active用于保存当前下载的任务
添加deactive用于在下载完成后删除active中对应记录
继续,middleware.download()
def download(self, download_func, request, spider):
@defer.inlineCallbacks
def process_request(request):
for method in self.methods['process_request']:
response = yield method(request=request, spider=spider)
assert response is None or isinstance(response, (Response, Request)), \
'Middleware %s.process_request must return None, Response or Request, got %s' % \
(six.get_method_self(method).__class__.__name__, response.__class__.__name__)
if response:
defer.returnValue(response)
defer.returnValue((yield download_func(request=request,spider=spider)))
@defer.inlineCallbacks
def process_response(response):
assert response is not None, 'Received None in process_response'
if isinstance(response, Request):
defer.returnValue(response)
for method in self.methods['process_response']:
response = yield method(request=request, response=response,
spider=spider)
assert isinstance(response, (Response, Request)), \
'Middleware %s.process_response must return Response or Request, got %s' % \
(six.get_method_self(method).__class__.__name__, type(response))
if isinstance(response, Request):
defer.returnValue(response)
defer.returnValue(response)
@defer.inlineCallbacks
def process_exception(_failure):
exception = _failure.value
for method in self.methods['process_exception']:
response = yield method(request=request, exception=exception,
spider=spider)
assert response is None or isinstance(response, (Response, Request)), \
'Middleware %s.process_exception must return None, Response or Request, got %s' % \
(six.get_method_self(method).__class__.__name__, type(response))
if response:
defer.returnValue(response)
defer.returnValue(_failure)
deferred = mustbe_deferred(process_request, request)
deferred.addErrback(process_exception)
deferred.addCallback(process_response)
return deferred
这里完成了下载中间件的处理
然后调用defer.returnValue((yield download_func(request=request,spider=spider)))
实际上就是download的self._enqueue_request
向下走
def _enqueue_request(self, request, spider):
key, slot = self._get_slot(request, spider)
request.meta['download_slot'] = key
def _deactivate(response):
slot.active.remove(request)
return response
slot.active.add(request)
self.signals.send_catch_log(signal=signals.request_reached_downloader,
request=request,
spider=spider)
deferred = defer.Deferred().addBoth(_deactivate)
slot.queue.append((request, deferred))
self._process_queue(spider, slot)
return deferred
老规矩,slot保存任务信息
调用self._process_queue
下载延迟是在这里进行的。
def _process_queue(self, spider, slot):
if slot.latercall and slot.latercall.active():
return
# Delay queue processing if a download_delay is configured
now = time()
delay = slot.download_delay()
if delay:
penalty = delay - now + slot.lastseen
if penalty > 0:
slot.latercall = reactor.callLater(penalty, self._process_queue, spider, slot)
return
# Process enqueued requests if there are free slots to transfer for this slot
while slot.queue and slot.free_transfer_slots() > 0:
slot.lastseen = now
request, deferred = slot.queue.popleft()
dfd = self._download(slot, request, spider)
dfd.chainDeferred(deferred)
# prevent burst if inter-request delays were configured
if delay:
self._process_queue(spider, slot)
break
下载语句_download()
def _download(self, slot, request, spider):
# The order is very important for the following deferreds. Do not change!
# 1. Create the download deferred
dfd = mustbe_deferred(self.handlers.download_request, request, spider)
# 2. Notify response_downloaded listeners about the recent download
# before querying queue for next request
def _downloaded(response):
self.signals.send_catch_log(signal=signals.response_downloaded,
response=response,
request=request,
spider=spider)
return response
dfd.addCallback(_downloaded)
# 3. After response arrives, remove the request from transferring
# state to free up the transferring slot so it can be used by the
# following requests (perhaps those which came from the downloader
# middleware itself)
slot.transferring.add(request)
def finish_transferring(_):
slot.transferring.remove(request)
self._process_queue(spider, slot)
return _
return dfd.addBoth(finish_transferring)
在这里,也维护了一个下载队列,可根据配置达到延迟下载的要求。真正发起下载请求的是调用了self.handlers.download_request:
后面的水有点深了,涉及的知识点比较多,以后有机会再写。
每个下载handler可以理解为requests包,输入url输出response就可以了。
笔记-scrapy-请求-下载-结果处理流程的更多相关文章
- 一、scrapy的下载安装---Windows(安装软件太让我伤心了)
写博客就和笔记一样真的很有用,你可以随时的翻阅.爬虫的爬虫原理与数据抓取.非结构化与结构化数据提取.动态HTML处理和简单的图像识别已经学完,就差整理博客了 开始学习scrapy了,所以重新建了个分类 ...
- 浅析Scrapy框架运行的基本流程
本篇博客将从Twisted的下载任务基本流程开始介绍,然后再一步步过渡到Scrapy框架的基本运行流程,其中还会需要我们自定义一个Low版的Scrapy框架.但内容不会涉及太多具体细节,而且需要注意的 ...
- UA池 代理IP池 scrapy的下载中间件
# 一些概念 - 在scrapy中如何给所有的请求对象尽可能多的设置不一样的请求载体身份标识 - UA池,process_request(request) - 在scrapy中如何给发生异常的请求设置 ...
- PHP实现下载功能之流程分析
客户端从服务端下载文件的流程分析: 浏览器发送一个请求,请求访问服务器中的某个网页(如:down.php),该网页的代码如下. 服务器接受到该请求以后,马上运行该down.php文件 运行该文件的时候 ...
- Django:学习笔记(4)——请求与响应
Django:学习笔记(4)——请求与响应 0.URL路由基础 Web应用中,用户通过不同URL链接访问我们提供的服务,其中首先经过的是一个URL调度器,它类似于SpringBoot中的前端控制器. ...
- Scrapy——6 APP抓包—scrapy框架下载图片
Scrapy——6 怎样进行APP抓包 scrapy框架抓取APP豆果美食数据 怎样用scrapy框架下载图片 怎样用scrapy框架去下载斗鱼APP的图片? Scrapy创建下载图片常见那些问题 怎 ...
- Scrapy——5 下载中间件常用函数、scrapy怎么对接selenium、常用的Setting内置设置有哪些
Scrapy——5 下载中间件常用的函数 Scrapy怎样对接selenium 常用的setting内置设置 对接selenium实战 (Downloader Middleware)下载中间件常用函数 ...
- 【技术博客】Flutter—使用网络请求的页面搭建流程、State生命周期、一些组件的应用
Flutter-使用网络请求的页面搭建流程.State生命周期.一些组件的应用 使用网络请求的页面搭建流程 在开发APP时,我们常常会遇到如下场景:进入一个页面后,要先进行网络调用,然后使用调用返 ...
- [ASP.NET MVC] ASP.NET Identity学习笔记 - 原始码下载、ID型别差异
[ASP.NET MVC] ASP.NET Identity学习笔记 - 原始码下载.ID型别差异 原始码下载 ASP.NET Identity是微软所贡献的开源项目,用来提供ASP.NET的验证.授 ...
随机推荐
- 微软RPC技术学习小结
RPC,即Remote Procedure Call,远程过程调用,是进程间通信(IPC, Inter Process Communication)技术的一种.由于这项技术在自己所在项目(Window ...
- 带你了解强大的Cadence家族,你可能只用到了它1/10的工具
[转载自 SI-list[中国]http://mp.weixin.qq.com/s/qsdfzQwIVjvwHXuCdvrPXA ] 本篇对2017年初版Cadence的全套所有EDA工具的技术特性特 ...
- sharepoint2010的几个类型字段赋值和取值的方法
1.日期类型查询,需要转换,方法如下: //转换时间 string startdate = SPUtility.CreateISO8601DateTimeFromSystemDateTime(Date ...
- 安装纯净 ubuntu linux (非虚拟机)
//--------------- Chinese version --------------------------------------------------// 前提条件:有另一台电脑(w ...
- 12/13 exercise
gcc -[cog] gcc pro1.o pro2.o //create a executable file x.out if unnamed
- SAP成都研究院飞机哥: SAP C4C中国本地化之微信聊天机器人的集成
今天的文章仍然来自Jerry的老同事,SAP成都研究院的张航(Zhang Harry).关于他的背景介绍,请参考张航之前的文章:SAP成都研究院飞机哥:程序猿和飞机的不解之缘.下面是他的正文. 大家好 ...
- CFG的定义
最近在CMU上NLP,好吧 对于见了很多年的CFG(Context-Free Grammar)发现又搞不懂是什么了 教材上写的是: mathematical system for modeling c ...
- R cannot be resolved的几种可能 R not generated
项目又爆红了,Eclipse真是够操心,顺便看一下R cannot be resolved的几种可能 这次是SVN合并的问题 2015-12-24 主要看 Console 输出的问题位置即可,一般都是 ...
- 共变导数(Covariant Derivative)
原文链接 导数是指某一点的导数表示了某点上指定函数的变化率. 比如,要确定某物体的速度在某时刻的加速度,就取时间轴上下一时刻的一个微小增量,然后考察速度的增量和时间增量的比值.如果这个比值比较大,说明 ...
- Android笔记(预安装APK)
一般一个安卓的产品在出厂时,会预安装许多APK,关于这些APP,主要分为下面这几类 1.系统级别APK 这一类应用一般是:电话/设置或者厂家自己特定的应用. 2.系统预安装APK 因为商业原因,产品出 ...