Python.Scrapy.14-scrapy-source-code-analysis-part-4
Scrapy 源代码分析系列-4 scrapy.commands 子包
子包scrapy.commands定义了在命令scrapy中使用的子命令(subcommand): bench, check, crawl, deploy, edit, fetch,
genspider, list, parse, runspider, settings, shell, startproject, version, view。 所有的子命令模块都定义了一个继承自
类ScrapyCommand的子类Command。
首先来看一下子命令crawl, 该子命令用来启动spider。
1. crawl.py
关注的重点在方法run(self, args, opts):
def run(self, args, opts):
if len(args) < 1:
raise UsageError()
elif len(args) > 1:
raise UsageError("running 'scrapy crawl' with more than one spider is no longer supported")
spname = args[0] crawler = self.crawler_process.create_crawler() # A
spider = crawler.spiders.create(spname, **opts.spargs) # B
crawler.crawl(spider) # C
self.crawler_process.start() # D
那么问题来啦,run接口方法是从哪里调用的呢? 让我们回到 Python.Scrapy.11-scrapy-source-code-analysis-part-1
中 "1.2 cmdline.py command.py" 关于"_run_print_help() "的说明。
A: 创建类Crawler对象crawler。在创建Crawler对象时, 同时将创建Crawler对象的实例属性spiders(SpiderManager)。如下所示:
class Crawler(object):
def __init__(self, settings):
self.configured = False
self.settings = settings
self.signals = SignalManager(self)
self.stats = load_object(settings['STATS_CLASS'])(self)
self._start_requests = lambda: ()
self._spider = None
# TODO: move SpiderManager to CrawlerProcess
spman_cls = load_object(self.settings['SPIDER_MANAGER_CLASS'])
self.spiders = spman_cls.from_crawler(self) # spiders 的类型是: SpiderManager
Crawler对象对应一个SpiderManager对象,而SpiderManager对象管理多个Spider。
B: 获取Sipder对象。
C: 为Spider对象安装Crawler对象。(为蜘蛛安装爬行器)
D: 类CrawlerProcess的start()方法如下:
def start(self):
if self.start_crawling():
self.start_reactor() def start_crawling(self):
log.scrapy_info(self.settings)
return self._start_crawler() is not None def start_reactor(self):
if self.settings.getbool('DNSCACHE_ENABLED'):
reactor.installResolver(CachingThreadedResolver(reactor))
reactor.addSystemEventTrigger('before', 'shutdown', self.stop)
reactor.run(installSignalHandlers=False) # blocking call def _start_crawler(self):
if not self.crawlers or self.stopping:
return name, crawler = self.crawlers.popitem()
self._active_crawler = crawler
sflo = log.start_from_crawler(crawler)
crawler.configure()
crawler.install()
crawler.signals.connect(crawler.uninstall, signals.engine_stopped)
if sflo:
crawler.signals.connect(sflo.stop, signals.engine_stopped)
crawler.signals.connect(self._check_done, signals.engine_stopped)
crawler.start() # 调用类Crawler的start()方法
return name, crawler
类Crawler的start()方法如下:
def start(self):
yield defer.maybeDeferred(self.configure)
if self._spider:
yield self.engine.open_spider(self._spider, self._start_requests()) # 和Engine建立了联系 (ExecutionEngine)
yield defer.maybeDeferred(self.engine.start)
关于类ExecutionEngine将在子包scrapy.core分析涉及。
2. startproject.py
3. subcommand是如何加载的
在cmdline.py的方法execute()中有如下几行代码:
inproject = inside_project()
cmds = _get_commands_dict(settings, inproject)
cmdname = _pop_command_name(argv)
_get_commands_dict():
def _get_commands_dict(settings, inproject):
cmds = _get_commands_from_module('scrapy.commands', inproject)
cmds.update(_get_commands_from_entry_points(inproject))
cmds_module = settings['COMMANDS_MODULE']
if cmds_module:
cmds.update(_get_commands_from_module(cmds_module, inproject))
return cmds
_get_commands_from_module():
def _get_commands_from_module(module, inproject):
d = {}
for cmd in _iter_command_classes(module):
if inproject or not cmd.requires_project:
cmdname = cmd.__module__.split('.')[-1]
d[cmdname] = cmd()
return d
To Be Continued
接下来解析settings相关的逻辑。Python.Scrapy.15-scrapy-source-code-analysis-part-5
Python.Scrapy.14-scrapy-source-code-analysis-part-4的更多相关文章
- Memcached source code analysis (threading model)--reference
Look under the start memcahced threading process memcached multi-threaded mainly by instantiating mu ...
- Golang Template source code analysis(Parse)
This blog was written at go 1.3.1 version. We know that we use template thought by followed way: fun ...
- Memcached source code analysis -- Analysis of change of state--reference
This article mainly introduces the process of Memcached, libevent structure of the main thread and w ...
- Apache Commons Pool2 源码分析 | Apache Commons Pool2 Source Code Analysis
Apache Commons Pool实现了对象池的功能.定义了对象的生成.销毁.激活.钝化等操作及其状态转换,并提供几个默认的对象池实现.在讲述其实现原理前,先提一下其中有几个重要的对象: Pool ...
- Redis source code analysis
http://zhangtielei.com/posts/blog-redis-dict.html http://zhangtielei.com/assets/photos_redis/redis_d ...
- linux kernel & source code analysis& hacking
https://kernelnewbies.org/ http://www.tldp.org/LDP/lki/index.html https://kernelnewbies.org/ML https ...
- 2018.6.21 HOLTEK HT49R70A-1 Source Code analysis
Cange note: “Reading TMR1H will latch the contents of TMR1H and TMR1L counter to the destination”? F ...
- The Ultimate List of Open Source Static Code Analysis Security Tools
https://www.checkmarx.com/2014/11/13/the-ultimate-list-of-open-source-static-code-analysis-security- ...
- Top 40 Static Code Analysis Tools
https://www.softwaretestinghelp.com/tools/top-40-static-code-analysis-tools/ In this article, I have ...
- Source Code Reading for Vue 3: How does `hasChanged` work?
Hey, guys! The next generation of Vue has released already. There are not only the brand new composi ...
随机推荐
- LeetCode Potential Thought Pitfalls
Problem Reason Reference Moving ZeroesSort Colors Corner cases Shortest Word Distance Thought: 2 p ...
- 利用硬链接和truncate降低drop table对线上环境的影响
众所周知drop table会严重的消耗服务器IO性能,如果被drop的table容量较大,甚至会影响到线上的正常. 首先,我们看一下为什么drop容量大的table会影响线上服务 直接执行drop ...
- Eclipse启动Tomcat后无法访问项目
Eclipse中的Tomcat可以正常启动,不过发布项目之后,无法访问,包括http://localhost:8080/的小猫页面也无法访问到,报404错误.这是因为Eclipse所指定的Server ...
- java readLine()
原文 虽然写IO方面的程序不多,但BufferedReader/BufferedInputStream倒是用过好几次的,原因是: 它有一个很特别的方法:readLine(),使用起来特别方便,每次读回 ...
- (整理)SQLServer_DBA 工具
本文是SQLserver DBA 相关工具使用资料链接整理. SQLServer DBA 十大工具:http://www.cnblogs.com/fygh/archive/2012/04/25/246 ...
- 【VB.NET】文本框快捷键支持
我们知道VB.NET中的文本框是不支持Ctrl+A的快捷键的. 如果让它支持呢? Private Sub txtSQL_KeyDown(ByVal sender As Object, ByVal e ...
- (转载)关于web端功能测试的测试方向
一.功能测试 1.1链接测试 链接是web应用系统的一个很重要的特征,主要是用于页面之间切换跳转,指导用户去一些不知道地址的页面的主要手段,链接测试一般关注三点: 1)链接是否按照既定指示那样,确实链 ...
- 精妙SQL语句
asc 按升序排列desc 按降序排列 下列语句部分是Mssql语句,不可以在access中使用.SQL分类: DDL—数据定义语言(CREATE,ALTER,DROP,DECLARE) DML—数据 ...
- skiplist
§1 Skip List 介绍 Skip List是一种随机化的数据结构,基于并联的链表,其效率可比拟于二叉查找树(对于大多数操作需要O(log n)平均时间).基本上, 跳跃列表是对有序的链表增加上 ...
- python代码中指定时区获取时间方法
os.environ['TZ'] = 'Asia/Shanghai' os.environ['TZ'] = 'Europe/London' hour_cur = time.strftime('%H')