笔记-scrapy-请求-下载-结果处理流程

在使用时发现对scrpy的下载过程中的处理逻辑还是不太明晰,-写个文档温习一下。

1.      请求-下载-结果处理流程

从哪开始呢?

engine.py

@defer.inlineCallbacks

def open_spider(self, spider, start_requests=(), close_if_idle=True):

assert self.has_capacity(), "No free spider slot when opening %r" % \

spider.name

logger.info("Spider opened", extra={'spider': spider})

nextcall = CallLaterOnce(self._next_request, spider)

scheduler = self.scheduler_cls.from_crawler(self.crawler)

start_requests = yield self.scraper.spidermw.process_start_requests(start_requests, spider)

slot = Slot(start_requests, close_if_idle, nextcall, scheduler)

self.slot = slot

self.spider = spider

yield scheduler.open(spider)

yield self.scraper.open_spider(spider)

self.crawler.stats.open_spider(spider)

yield self.signals.send_catch_log_deferred(signals.spider_opened, spider=spider)

slot.nextcall.schedule() #

slot.heartbeat.start(5)

注意最后两句

nextcall 是自己写的一个reactor调用中间类,在这里实际是把self._next_request加入了reactor task队列。

slot.nextcall.schedule() 等于调用self._next_request

slot.heartbeat.start(5)声明每5秒调用一次nextcall.schedule

reactor是跑起来了

看到self._next_request

控制循环

def _next_request(self, spider):

slot = self.slot

if not slot:

return

if self.paused:

return

while not self._needs_backout(spider):

if not self._next_request_from_scheduler(spider):

break

if slot.start_requests and not self._needs_backout(spider):

try:

request = next(slot.start_requests)

except StopIteration:

slot.start_requests = None

except Exception:

slot.start_requests = None

logger.error('Error while obtaining start requests',

exc_info=True, extra={'spider': spider})

else:

self.crawl(request, spider)

if self.spider_is_idle(spider) and slot.close_if_idle:

self._spider_idle(spider)

_needs_backout是用于判断爬虫状态的函数

def _next_request_from_scheduler(self, spider):

slot = self.slot

request = slot.scheduler.next_request()

if not request:

return

d = self._download(request, spider)

d.addBoth(self._handle_downloader_output, request, spider)

d.addErrback(lambda f: logger.info('Error while handling downloader output',

exc_info=failure_to_exc_info(f),

extra={'spider': spider}))

d.addBoth(lambda _: slot.remove_request(request))

d.addErrback(lambda f: logger.info('Error while removing request from slot',

exc_info=failure_to_exc_info(f),

extra={'spider': spider}))

d.addBoth(lambda _: slot.nextcall.schedule())

d.addErrback(lambda f: logger.info('Error while scheduling new request',

exc_info=failure_to_exc_info(f),

extra={'spider': spider}))

return d

d 是一个defer对象,然后为它添加了一堆回调函数,包括后续处理,从请求队列中删除,取下一个请求(执行nextcall.schedule(),也就是self._next_request())

Slot是当前请求的状态保存类;

下面要分两条线了,一条是下载_download(单列章节),一条是下载返回结果处理。

下载结果处理:

def _handle_downloader_output(self, response, request, spider):

assert isinstance(response, (Request, Response, Failure)), response

# downloader middleware can return requests (for example, redirects)

if isinstance(response, Request):

self.crawl(response, spider)

return

# response is a Response or Failure

d = self.scraper.enqueue_scrape(response, request, spider)

d.addErrback(lambda f: logger.error('Error while enqueuing downloader output',

exc_info=failure_to_exc_info(f),

extra={'spider': spider}))

return d

继续d = self.scraper.enqueue_scrape(response, request, spider)

def enqueue_scrape(self, response, request, spider):

slot = self.slot

dfd = slot.add_response_request(response, request)

def finish_scraping(_):

slot.finish_response(response, request) # slot状态更新

self._check_if_closing(spider, slot)

self._scrape_next(spider, slot) # 下一步

return _

dfd.addBoth(finish_scraping)

dfd.addErrback(

lambda f: logger.error('Scraper bug processing %(request)s',

{'request': request},

exc_info=failure_to_exc_info(f),

extra={'spider': spider}))

self._scrape_next(spider, slot) #下一步

return dfd

照例:slot用于保存scraper当前任务队列及状态

继续:

def _scrape_next(self, spider, slot):

while slot.queue:

response, request, deferred = slot.next_response_request_deferred()

self._scrape(response, request, spider).chainDeferred(deferred)

def _scrape(self, response, request, spider):

"""Handle the downloaded response or failure through the spider

callback/errback"""

assert isinstance(response, (Response, Failure))

dfd = self._scrape2(response, request, spider) # returns spiders processed output

dfd.addErrback(self.handle_spider_error, request, response, spider)

dfd.addCallback(self.handle_spider_output, request, response, spider)

return dfd

添加回调函数dfd.addCallback(self.handle_spider_output, request, response, spider)

注意:dfd = self._scrape2(response, request, spider) # returns spiders processed output会完成添加spider解析的回调函数。

def call_spider(self, result, request, spider):

result.request = request

dfd = defer_result(result)

dfd.addCallbacks(request.callback or spider.parse, request.errback)

return dfd.addCallback(iterate_spider_output)

关键句:dfd.addCallbacks(request.callback or spider.parse, request.errback)

继续看self.handle_spider_output,

def handle_spider_output(self, result, request, response, spider):

if not result:

return defer_succeed(None)

it = iter_errback(result, self.handle_spider_error, request, response, spider)

dfd = parallel(it, self.concurrent_items,

self._process_spidermw_output, request, response, spider)

return dfd

def _process_spidermw_output(self, output, request, response, spider):

"""Process each Request/Item (given in the output parameter) returned

from the given spider

"""

if isinstance(output, Request):

self.crawler.engine.crawl(request=output, spider=spider)

elif isinstance(output, (BaseItem, dict)):

self.slot.itemproc_size += 1

dfd = self.itemproc.process_item(output, spider)

dfd.addBoth(self._itemproc_finished, output, response, spider)

return dfd

elif output is None:

pass

else:

typename = type(output).__name__

logger.error('Spider must return Request, BaseItem, dict or None, '

'got %(typename)r in %(request)s',

{'request': request, 'typename': typename},

extra={'spider': spider})

如果返回的是请求,crawl()

如果是item或dict,self.itemproc.process_item(output, spider);也就是pipeline中写的process_item方法了

1.1.    小结

代码中对功能函数的拆分,调用还是有一点复杂的,但这样做的好处是代码块之间的耦合性不高,可以非常方便的进行某一函数或功能块的替换。

2.      下载器

下载流程解析

从engine.py中开始:

def _download(self, request, spider):

slot = self.slot

# slot 状态更新

slot.add_request(request)

def _on_success(response):

assert isinstance(response, (Response, Request))

if isinstance(response, Response):

response.request = request # tie request to response received

logkws = self.logformatter.crawled(request, response, spider)

logger.log(*logformatter_adapter(logkws), extra={'spider': spider})

self.signals.send_catch_log(signal=signals.response_received, \

response=response, request=request, spider=spider)

return response

def _on_complete(_):

slot.nextcall.schedule()

return _

dwld = self.downloader.fetch(request, spider)

dwld.addCallbacks(_on_success)

dwld.addBoth(_on_complete)

return dwld

主要是声明了一个deffer并添加了一些回调函数用于状态更新,信号传递。

dwld = self.downloader.fetch(request, spider)

进入downloader.fetch()

def fetch(self, request, spider):

def _deactivate(response):

self.active.remove(request)

return response

self.active.add(request)

dfd = self.middleware.download(self._enqueue_request, request, spider)

return dfd.addBoth(_deactivate)

active用于保存当前下载的任务

添加deactive用于在下载完成后删除active中对应记录

继续,middleware.download()

def download(self, download_func, request, spider):

@defer.inlineCallbacks

def process_request(request):

for method in self.methods['process_request']:

response = yield method(request=request, spider=spider)

assert response is None or isinstance(response, (Response, Request)), \

'Middleware %s.process_request must return None, Response or Request, got %s' % \

(six.get_method_self(method).__class__.__name__, response.__class__.__name__)

if response:

defer.returnValue(response)

defer.returnValue((yield download_func(request=request,spider=spider)))

@defer.inlineCallbacks

def process_response(response):

assert response is not None, 'Received None in process_response'

if isinstance(response, Request):

defer.returnValue(response)

for method in self.methods['process_response']:

response = yield method(request=request, response=response,

spider=spider)

assert isinstance(response, (Response, Request)), \

'Middleware %s.process_response must return Response or Request, got %s' % \

(six.get_method_self(method).__class__.__name__, type(response))

if isinstance(response, Request):

defer.returnValue(response)

defer.returnValue(response)

@defer.inlineCallbacks

def process_exception(_failure):

exception = _failure.value

for method in self.methods['process_exception']:

response = yield method(request=request, exception=exception,

spider=spider)

assert response is None or isinstance(response, (Response, Request)), \

'Middleware %s.process_exception must return None, Response or Request, got %s' % \

(six.get_method_self(method).__class__.__name__, type(response))

if response:

defer.returnValue(response)

defer.returnValue(_failure)

deferred = mustbe_deferred(process_request, request)

deferred.addErrback(process_exception)

deferred.addCallback(process_response)

return deferred

这里完成了下载中间件的处理

然后调用defer.returnValue((yield download_func(request=request,spider=spider)))

实际上就是download的self._enqueue_request

向下走

def _enqueue_request(self, request, spider):

key, slot = self._get_slot(request, spider)

request.meta['download_slot'] = key

def _deactivate(response):

slot.active.remove(request)

return response

slot.active.add(request)

self.signals.send_catch_log(signal=signals.request_reached_downloader,

request=request,

spider=spider)

deferred = defer.Deferred().addBoth(_deactivate)

slot.queue.append((request, deferred))

self._process_queue(spider, slot)

return deferred

老规矩,slot保存任务信息

调用self._process_queue

下载延迟是在这里进行的。

def _process_queue(self, spider, slot):

if slot.latercall and slot.latercall.active():

return

# Delay queue processing if a download_delay is configured

now = time()

delay = slot.download_delay()

if delay:

penalty = delay - now + slot.lastseen

if penalty > 0:

slot.latercall = reactor.callLater(penalty, self._process_queue, spider, slot)

return

# Process enqueued requests if there are free slots to transfer for this slot

while slot.queue and slot.free_transfer_slots() > 0:

slot.lastseen = now

request, deferred = slot.queue.popleft()

dfd = self._download(slot, request, spider)

dfd.chainDeferred(deferred)

# prevent burst if inter-request delays were configured

if delay:

self._process_queue(spider, slot)

break

下载语句_download()

def _download(self, slot, request, spider):

# The order is very important for the following deferreds. Do not change!

# 1. Create the download deferred

dfd = mustbe_deferred(self.handlers.download_request, request, spider)

# 2. Notify response_downloaded listeners about the recent download

# before querying queue for next request

def _downloaded(response):

self.signals.send_catch_log(signal=signals.response_downloaded,

response=response,

request=request,

spider=spider)

return response

dfd.addCallback(_downloaded)

# 3. After response arrives,  remove the request from transferring

# state to free up the transferring slot so it can be used by the

# following requests (perhaps those which came from the downloader

# middleware itself)

slot.transferring.add(request)

def finish_transferring(_):

slot.transferring.remove(request)

self._process_queue(spider, slot)

return _

return dfd.addBoth(finish_transferring)

在这里,也维护了一个下载队列,可根据配置达到延迟下载的要求。真正发起下载请求的是调用了self.handlers.download_request:

后面的水有点深了,涉及的知识点比较多,以后有机会再写。

每个下载handler可以理解为requests包,输入url输出response就可以了。

笔记-scrapy-请求-下载-结果处理流程的更多相关文章

  1. 一、scrapy的下载安装---Windows(安装软件太让我伤心了)

    写博客就和笔记一样真的很有用,你可以随时的翻阅.爬虫的爬虫原理与数据抓取.非结构化与结构化数据提取.动态HTML处理和简单的图像识别已经学完,就差整理博客了 开始学习scrapy了,所以重新建了个分类 ...

  2. 浅析Scrapy框架运行的基本流程

    本篇博客将从Twisted的下载任务基本流程开始介绍,然后再一步步过渡到Scrapy框架的基本运行流程,其中还会需要我们自定义一个Low版的Scrapy框架.但内容不会涉及太多具体细节,而且需要注意的 ...

  3. UA池 代理IP池 scrapy的下载中间件

    # 一些概念 - 在scrapy中如何给所有的请求对象尽可能多的设置不一样的请求载体身份标识 - UA池,process_request(request) - 在scrapy中如何给发生异常的请求设置 ...

  4. PHP实现下载功能之流程分析

    客户端从服务端下载文件的流程分析: 浏览器发送一个请求,请求访问服务器中的某个网页(如:down.php),该网页的代码如下. 服务器接受到该请求以后,马上运行该down.php文件 运行该文件的时候 ...

  5. Django:学习笔记(4)——请求与响应

    Django:学习笔记(4)——请求与响应 0.URL路由基础 Web应用中,用户通过不同URL链接访问我们提供的服务,其中首先经过的是一个URL调度器,它类似于SpringBoot中的前端控制器. ...

  6. Scrapy——6 APP抓包—scrapy框架下载图片

    Scrapy——6 怎样进行APP抓包 scrapy框架抓取APP豆果美食数据 怎样用scrapy框架下载图片 怎样用scrapy框架去下载斗鱼APP的图片? Scrapy创建下载图片常见那些问题 怎 ...

  7. Scrapy——5 下载中间件常用函数、scrapy怎么对接selenium、常用的Setting内置设置有哪些

    Scrapy——5 下载中间件常用的函数 Scrapy怎样对接selenium 常用的setting内置设置 对接selenium实战 (Downloader Middleware)下载中间件常用函数 ...

  8. 【技术博客】Flutter—使用网络请求的页面搭建流程、State生命周期、一些组件的应用

    Flutter-使用网络请求的页面搭建流程.State生命周期.一些组件的应用 使用网络请求的页面搭建流程 ​ 在开发APP时,我们常常会遇到如下场景:进入一个页面后,要先进行网络调用,然后使用调用返 ...

  9. [ASP.NET MVC] ASP.NET Identity学习笔记 - 原始码下载、ID型别差异

    [ASP.NET MVC] ASP.NET Identity学习笔记 - 原始码下载.ID型别差异 原始码下载 ASP.NET Identity是微软所贡献的开源项目,用来提供ASP.NET的验证.授 ...

随机推荐

  1. flexbox预习

    创建一个flexbox: .flex-container{ display:flex; } flex-direction:  column;//将flex排成一列 flex-direction: co ...

  2. iDempiere 使用指南 销售发货流程

    Created by 蓝色布鲁斯,QQ32876341,blog http://www.cnblogs.com/zzyan/ iDempiere官方中文wiki主页 http://wiki.idemp ...

  3. 【起航计划 029】2015 起航计划 Android APIDemo的魔鬼步伐 28 App->Preferences->Default Values 偏好默认值

    DefaultValues 介绍了如何在XML中定义Preference的缺省值. <CheckBoxPreference android:key="default_checkbox& ...

  4. windows 2008 R2-Zabbix server 3.0监控主机的加入

    一.关闭windows防火墙或者开通10050和10051端口 (1).关闭防火墙 开始—控制面板—windows防火墙 按照要求关闭防火墙 (2).开通端口 1.开始—管理工具--高级安全windo ...

  5. Python基础学习之文件(2)

    文件内建方法 1.输入 read()方法用来直接读取字节到字符串中,最多读取给定数目个字节.如果没有给定size参数(默认值为-1)或size值为负,文件将被读取直至末尾. readline()方法读 ...

  6. Selenium入门系列2 窗口大小控制

    selenium控制窗口最大化.适合手机的宽度.适合pad的宽度等尝试下实例,网站是否做了响应式布局 #coding=utf-8 # 改变浏览器窗口大小.前进后退 from selenium impo ...

  7. IOS transform的使用(移动,放大,旋转)

    @interface ViewController () - (IBAction)up; - (IBAction)big ; - (IBAction)leftRotate ; @property (n ...

  8. 使用navigate导出表数据

    以下内容不算技术贴,只能算是技巧贴,要做的一个操作,从数据库A中把元素A1表,导入到数据库B中B1表,且,A1表的数据是部分导出,那么有两种方法进行导出 方法1: 选择数据表,右键,选择“转存储sql ...

  9. POJ-3274 Gold Balanced Lineup---hash经典题!

    题目链接: https://vjudge.net/problem/POJ-3274 题目大意: 给定多头牛的属性,每头牛的属性由一个非负数表示,该数的二进制表示不会超过K位,它的二进制表示的每一位若为 ...

  10. 数据库连接-ADO.NET

    版权声明:本文为博主原创文章,未经博主同意不得转载. https://blog.csdn.net/huo065000/article/details/25830291       非常早就知道了ADO ...