Scrapy 源代码分析系列-4 scrapy.commands 子包

子包scrapy.commands定义了在命令scrapy中使用的子命令(subcommand): bench, check, crawl, deploy, edit, fetch,

genspider, list, parse, runspider, settings, shell, startproject, version, view。 所有的子命令模块都定义了一个继承自

类ScrapyCommand的子类Command。

首先来看一下子命令crawl, 该子命令用来启动spider。

1. crawl.py

关注的重点在方法run(self, args, opts):

 def run(self, args, opts):
if len(args) < 1:
raise UsageError()
elif len(args) > 1:
raise UsageError("running 'scrapy crawl' with more than one spider is no longer supported")
spname = args[0] crawler = self.crawler_process.create_crawler() # A
spider = crawler.spiders.create(spname, **opts.spargs) # B
crawler.crawl(spider) # C
self.crawler_process.start() # D

那么问题来啦,run接口方法是从哪里调用的呢? 让我们回到 Python.Scrapy.11-scrapy-source-code-analysis-part-1

中 "1.2 cmdline.py command.py" 关于"_run_print_help() "的说明。

A: 创建类Crawler对象crawler。在创建Crawler对象时, 同时将创建Crawler对象的实例属性spiders(SpiderManager)。如下所示:

 class Crawler(object):

     def __init__(self, settings):
self.configured = False
self.settings = settings
self.signals = SignalManager(self)
self.stats = load_object(settings['STATS_CLASS'])(self)
self._start_requests = lambda: ()
self._spider = None
# TODO: move SpiderManager to CrawlerProcess
spman_cls = load_object(self.settings['SPIDER_MANAGER_CLASS'])
self.spiders = spman_cls.from_crawler(self) # spiders 的类型是: SpiderManager

Crawler对象对应一个SpiderManager对象,而SpiderManager对象管理多个Spider。

B: 获取Sipder对象。

C: 为Spider对象安装Crawler对象。(为蜘蛛安装爬行器)

D: 类CrawlerProcess的start()方法如下:

     def start(self):
if self.start_crawling():
self.start_reactor() def start_crawling(self):
log.scrapy_info(self.settings)
return self._start_crawler() is not None def start_reactor(self):
if self.settings.getbool('DNSCACHE_ENABLED'):
reactor.installResolver(CachingThreadedResolver(reactor))
reactor.addSystemEventTrigger('before', 'shutdown', self.stop)
reactor.run(installSignalHandlers=False) # blocking call def _start_crawler(self):
if not self.crawlers or self.stopping:
return name, crawler = self.crawlers.popitem()
self._active_crawler = crawler
sflo = log.start_from_crawler(crawler)
crawler.configure()
crawler.install()
crawler.signals.connect(crawler.uninstall, signals.engine_stopped)
if sflo:
crawler.signals.connect(sflo.stop, signals.engine_stopped)
crawler.signals.connect(self._check_done, signals.engine_stopped)
crawler.start() # 调用类Crawler的start()方法
return name, crawler

类Crawler的start()方法如下:

     def start(self):
yield defer.maybeDeferred(self.configure)
if self._spider:
yield self.engine.open_spider(self._spider, self._start_requests()) # 和Engine建立了联系 (ExecutionEngine)
yield defer.maybeDeferred(self.engine.start)

关于类ExecutionEngine将在子包scrapy.core分析涉及。

2. startproject.py

3. subcommand是如何加载的

在cmdline.py的方法execute()中有如下几行代码:

     inproject = inside_project()
cmds = _get_commands_dict(settings, inproject)
cmdname = _pop_command_name(argv)

_get_commands_dict():

 def _get_commands_dict(settings, inproject):
cmds = _get_commands_from_module('scrapy.commands', inproject)
cmds.update(_get_commands_from_entry_points(inproject))
cmds_module = settings['COMMANDS_MODULE']
if cmds_module:
cmds.update(_get_commands_from_module(cmds_module, inproject))
return cmds

_get_commands_from_module():

 def _get_commands_from_module(module, inproject):
d = {}
for cmd in _iter_command_classes(module):
if inproject or not cmd.requires_project:
cmdname = cmd.__module__.split('.')[-1]
d[cmdname] = cmd()
return d

To Be Continued

接下来解析settings相关的逻辑。Python.Scrapy.15-scrapy-source-code-analysis-part-5

Python.Scrapy.14-scrapy-source-code-analysis-part-4的更多相关文章

  1. Memcached source code analysis (threading model)--reference

    Look under the start memcahced threading process memcached multi-threaded mainly by instantiating mu ...

  2. Golang Template source code analysis(Parse)

    This blog was written at go 1.3.1 version. We know that we use template thought by followed way: fun ...

  3. Memcached source code analysis -- Analysis of change of state--reference

    This article mainly introduces the process of Memcached, libevent structure of the main thread and w ...

  4. Apache Commons Pool2 源码分析 | Apache Commons Pool2 Source Code Analysis

    Apache Commons Pool实现了对象池的功能.定义了对象的生成.销毁.激活.钝化等操作及其状态转换,并提供几个默认的对象池实现.在讲述其实现原理前,先提一下其中有几个重要的对象: Pool ...

  5. Redis source code analysis

    http://zhangtielei.com/posts/blog-redis-dict.html http://zhangtielei.com/assets/photos_redis/redis_d ...

  6. linux kernel & source code analysis& hacking

    https://kernelnewbies.org/ http://www.tldp.org/LDP/lki/index.html https://kernelnewbies.org/ML https ...

  7. 2018.6.21 HOLTEK HT49R70A-1 Source Code analysis

    Cange note: “Reading TMR1H will latch the contents of TMR1H and TMR1L counter to the destination”? F ...

  8. The Ultimate List of Open Source Static Code Analysis Security Tools

    https://www.checkmarx.com/2014/11/13/the-ultimate-list-of-open-source-static-code-analysis-security- ...

  9. Top 40 Static Code Analysis Tools

    https://www.softwaretestinghelp.com/tools/top-40-static-code-analysis-tools/ In this article, I have ...

  10. Source Code Reading for Vue 3: How does `hasChanged` work?

    Hey, guys! The next generation of Vue has released already. There are not only the brand new composi ...

随机推荐

  1. LeetCode Potential Thought Pitfalls

    Problem Reason Reference Moving ZeroesSort Colors Corner cases   Shortest Word Distance Thought: 2 p ...

  2. 利用硬链接和truncate降低drop table对线上环境的影响

    众所周知drop table会严重的消耗服务器IO性能,如果被drop的table容量较大,甚至会影响到线上的正常. 首先,我们看一下为什么drop容量大的table会影响线上服务 直接执行drop ...

  3. Eclipse启动Tomcat后无法访问项目

    Eclipse中的Tomcat可以正常启动,不过发布项目之后,无法访问,包括http://localhost:8080/的小猫页面也无法访问到,报404错误.这是因为Eclipse所指定的Server ...

  4. java readLine()

    原文 虽然写IO方面的程序不多,但BufferedReader/BufferedInputStream倒是用过好几次的,原因是: 它有一个很特别的方法:readLine(),使用起来特别方便,每次读回 ...

  5. (整理)SQLServer_DBA 工具

    本文是SQLserver DBA 相关工具使用资料链接整理. SQLServer DBA 十大工具:http://www.cnblogs.com/fygh/archive/2012/04/25/246 ...

  6. 【VB.NET】文本框快捷键支持

    我们知道VB.NET中的文本框是不支持Ctrl+A的快捷键的. 如果让它支持呢? Private Sub txtSQL_KeyDown(ByVal sender As Object, ByVal e ...

  7. (转载)关于web端功能测试的测试方向

    一.功能测试 1.1链接测试 链接是web应用系统的一个很重要的特征,主要是用于页面之间切换跳转,指导用户去一些不知道地址的页面的主要手段,链接测试一般关注三点: 1)链接是否按照既定指示那样,确实链 ...

  8. 精妙SQL语句

    asc 按升序排列desc 按降序排列 下列语句部分是Mssql语句,不可以在access中使用.SQL分类: DDL—数据定义语言(CREATE,ALTER,DROP,DECLARE) DML—数据 ...

  9. skiplist

    §1 Skip List 介绍 Skip List是一种随机化的数据结构,基于并联的链表,其效率可比拟于二叉查找树(对于大多数操作需要O(log n)平均时间).基本上, 跳跃列表是对有序的链表增加上 ...

  10. python代码中指定时区获取时间方法

    os.environ['TZ'] = 'Asia/Shanghai' os.environ['TZ'] = 'Europe/London' hour_cur = time.strftime('%H')