Scrapy 源代码分析系列-4 scrapy.commands 子包

子包scrapy.commands定义了在命令scrapy中使用的子命令(subcommand): bench, check, crawl, deploy, edit, fetch,

genspider, list, parse, runspider, settings, shell, startproject, version, view。 所有的子命令模块都定义了一个继承自

类ScrapyCommand的子类Command。

首先来看一下子命令crawl, 该子命令用来启动spider。

1. crawl.py

关注的重点在方法run(self, args, opts):

 def run(self, args, opts):
if len(args) < 1:
raise UsageError()
elif len(args) > 1:
raise UsageError("running 'scrapy crawl' with more than one spider is no longer supported")
spname = args[0] crawler = self.crawler_process.create_crawler() # A
spider = crawler.spiders.create(spname, **opts.spargs) # B
crawler.crawl(spider) # C
self.crawler_process.start() # D

那么问题来啦,run接口方法是从哪里调用的呢? 让我们回到 Python.Scrapy.11-scrapy-source-code-analysis-part-1

中 "1.2 cmdline.py command.py" 关于"_run_print_help() "的说明。

A: 创建类Crawler对象crawler。在创建Crawler对象时, 同时将创建Crawler对象的实例属性spiders(SpiderManager)。如下所示:

 class Crawler(object):

     def __init__(self, settings):
self.configured = False
self.settings = settings
self.signals = SignalManager(self)
self.stats = load_object(settings['STATS_CLASS'])(self)
self._start_requests = lambda: ()
self._spider = None
# TODO: move SpiderManager to CrawlerProcess
spman_cls = load_object(self.settings['SPIDER_MANAGER_CLASS'])
self.spiders = spman_cls.from_crawler(self) # spiders 的类型是: SpiderManager

Crawler对象对应一个SpiderManager对象,而SpiderManager对象管理多个Spider。

B: 获取Sipder对象。

C: 为Spider对象安装Crawler对象。(为蜘蛛安装爬行器)

D: 类CrawlerProcess的start()方法如下:

     def start(self):
if self.start_crawling():
self.start_reactor() def start_crawling(self):
log.scrapy_info(self.settings)
return self._start_crawler() is not None def start_reactor(self):
if self.settings.getbool('DNSCACHE_ENABLED'):
reactor.installResolver(CachingThreadedResolver(reactor))
reactor.addSystemEventTrigger('before', 'shutdown', self.stop)
reactor.run(installSignalHandlers=False) # blocking call def _start_crawler(self):
if not self.crawlers or self.stopping:
return name, crawler = self.crawlers.popitem()
self._active_crawler = crawler
sflo = log.start_from_crawler(crawler)
crawler.configure()
crawler.install()
crawler.signals.connect(crawler.uninstall, signals.engine_stopped)
if sflo:
crawler.signals.connect(sflo.stop, signals.engine_stopped)
crawler.signals.connect(self._check_done, signals.engine_stopped)
crawler.start() # 调用类Crawler的start()方法
return name, crawler

类Crawler的start()方法如下:

     def start(self):
yield defer.maybeDeferred(self.configure)
if self._spider:
yield self.engine.open_spider(self._spider, self._start_requests()) # 和Engine建立了联系 (ExecutionEngine)
yield defer.maybeDeferred(self.engine.start)

关于类ExecutionEngine将在子包scrapy.core分析涉及。

2. startproject.py

3. subcommand是如何加载的

在cmdline.py的方法execute()中有如下几行代码:

     inproject = inside_project()
cmds = _get_commands_dict(settings, inproject)
cmdname = _pop_command_name(argv)

_get_commands_dict():

 def _get_commands_dict(settings, inproject):
cmds = _get_commands_from_module('scrapy.commands', inproject)
cmds.update(_get_commands_from_entry_points(inproject))
cmds_module = settings['COMMANDS_MODULE']
if cmds_module:
cmds.update(_get_commands_from_module(cmds_module, inproject))
return cmds

_get_commands_from_module():

 def _get_commands_from_module(module, inproject):
d = {}
for cmd in _iter_command_classes(module):
if inproject or not cmd.requires_project:
cmdname = cmd.__module__.split('.')[-1]
d[cmdname] = cmd()
return d

To Be Continued

接下来解析settings相关的逻辑。Python.Scrapy.15-scrapy-source-code-analysis-part-5

Python.Scrapy.14-scrapy-source-code-analysis-part-4的更多相关文章

  1. Memcached source code analysis (threading model)--reference

    Look under the start memcahced threading process memcached multi-threaded mainly by instantiating mu ...

  2. Golang Template source code analysis(Parse)

    This blog was written at go 1.3.1 version. We know that we use template thought by followed way: fun ...

  3. Memcached source code analysis -- Analysis of change of state--reference

    This article mainly introduces the process of Memcached, libevent structure of the main thread and w ...

  4. Apache Commons Pool2 源码分析 | Apache Commons Pool2 Source Code Analysis

    Apache Commons Pool实现了对象池的功能.定义了对象的生成.销毁.激活.钝化等操作及其状态转换,并提供几个默认的对象池实现.在讲述其实现原理前,先提一下其中有几个重要的对象: Pool ...

  5. Redis source code analysis

    http://zhangtielei.com/posts/blog-redis-dict.html http://zhangtielei.com/assets/photos_redis/redis_d ...

  6. linux kernel & source code analysis& hacking

    https://kernelnewbies.org/ http://www.tldp.org/LDP/lki/index.html https://kernelnewbies.org/ML https ...

  7. 2018.6.21 HOLTEK HT49R70A-1 Source Code analysis

    Cange note: “Reading TMR1H will latch the contents of TMR1H and TMR1L counter to the destination”? F ...

  8. The Ultimate List of Open Source Static Code Analysis Security Tools

    https://www.checkmarx.com/2014/11/13/the-ultimate-list-of-open-source-static-code-analysis-security- ...

  9. Top 40 Static Code Analysis Tools

    https://www.softwaretestinghelp.com/tools/top-40-static-code-analysis-tools/ In this article, I have ...

  10. Source Code Reading for Vue 3: How does `hasChanged` work?

    Hey, guys! The next generation of Vue has released already. There are not only the brand new composi ...

随机推荐

  1. SQL Server 2012 使用警报调度数据库作业通知操作员

    如果想让数据库满足某种条件时,触动警报,然后执行一系列作业,并通知管理员,则需要配置警报 比如,当数据库日志达到一定大小如10M时触动警报,执行以下3个作业步骤:备份日志.收缩日志文件到2M.完备数据 ...

  2. iOS沙盒机制介绍,Block 的介绍

    一.iOS沙盒机制介绍 (转载) 1)概念:每个ios应用都有自己的应用沙盒,应用沙盒就是文件系统目录,与其他应用放入文件 系统隔离,ios系统不允许访问 其他应用的应用沙盒,但在ios8中已经开放访 ...

  3. [译] Python 3.5 协程究竟是个啥

    转自:http://blog.rainy.im/2016/03/10/how-the-heck-does-async-await-work-in-python-3-5/ [译] Python 3.5 ...

  4. yum install nginx

    先安装nginx的yum源 http://nginx.org/en/linux_packages.html#stable 找到链接,安装: rpm -ivh http://nginx.org/pack ...

  5. ArcGIS for Flex中引入google map作底图

    上篇文章到在ArcGIS View中引入google map,这里讲ArcGIS for Flex中引入google map作底图. 同样道理,以google map作底图,需要编写继承自TiledM ...

  6. spring读写分离

    import org.springframework.jdbc.datasource.lookup.AbstractRoutingDataSource; public class ChooseData ...

  7. vsftpd 修改默认目录

    默认配置下: 匿名用户登录 vsftpd 服务后的根目录是 /var/ftp/:系统用户登录 vsftpd 服务后的根目录是系统用户的家目录. 若要修改登录 vsftpd 服务后的根目录,只要修改 / ...

  8. CSS隐藏元素的几种方法

    使用CSS隐藏元素的方法很多,在这里简单总结一下: 1.display:none display:none 应该是最常用的一种隐藏元素的方法,使用该方法隐藏的元素脱离文档流不占据空间,不会被浏览器解析 ...

  9. 利用Resgen.exe 批量生成resources文件

    Resgen.exe(资源文件生成器)  您可以直接如图操作 转换时在 文本中先写好要转换的文件然后 全选 复制到控制台中 Filename.resx 要转换的文件 ResName1.resource ...

  10. kindeditor编辑器的使用

    KindEditor是一款用Javascript编写的开源在线HTML编辑器,主要用户是让用户在网站上获得可见即可得的编辑效果,开发人员可以用 KindEditor 把传统的多行文本输入框(texta ...