Scrapy框架基础:Twsited

Scrapy内部基于事件循环的机制实现爬虫的并发。
原来:

  1. url_list = ['http://www.baidu.com','http://www.baidu.com','http://www.baidu.com',]
  2.  
  3. for item in url_list:
  4.   response = requests.get(item)
  5.   print(response.text)

原来执行多个请求任务

现在: 

  1. from twisted.web.client import getPage, defer
  2. from twisted.internet import reactor
  3.  
  4. # 第一部分:代理开始接收任务
  5. def callback(contents):
  6.   print(contents)
  7.  
  8. deferred_list = [] # [(龙泰,贝贝),(刘淞,宝件套),(呼呼,东北)]
  9. url_list = ['http://www.bing.com', 'https://segmentfault.com/','https://stackoverflow.com/' ]
  10. for url in url_list:
  11.   deferred = getPage(bytes(url, encoding='utf8')) # (我,要谁)
  12.   deferred.addCallback(callback)
  13.   deferred_list.append(deferred)
  14.  
  15. # # 第二部分:代理执行完任务后,停止
  16. dlist = defer.DeferredList(deferred_list)
  17.  
  18. def all_done(arg):
  19.   reactor.stop()
  20.  
  21. dlist.addBoth(all_done)
  22.  
  23. # 第三部分:代理开始去处理吧
  24. reactor.run()

twisted

  1. 什么是twisted
  • 官方:基于事件循环的异步非阻塞模块。
  • 白话:一个线程同时可以向多个目标发起Http请求。
  1. 非阻塞:不等待,所有请求同时发出。 我向请求A、请求B、请求C发起连接请求的时候,不等连接返回结果之后再去连下一个,而是发送一个之后,马上发送下一个。
  1. import socket
  2. sk = socket.socket()
  3. sk.setblocking(False)
  4. sk.connect((1.1.1.1,80))
  5.  
  6. import socket
  7. sk = socket.socket()
  8. sk.setblocking(False)
  9. sk.connect((1.1.1.2,80))
  10.  
  11. import socket
  12. sk = socket.socket()
  13. sk.setblocking(False)
  14. sk.connect((1.1.1.3,80))

socket非阻塞

  1. 异步:回调。我一旦帮助callback_Acallback_Bcallback_F找到想要的AB,C,我会主动通知他们。
  1. def callback(contents):
  2. print(contents)

callback

  1. 事件循环: 我,我一直在循环三个socket任务(即:请求A、请求B、请求C),检查他三个状态:是否连接成功;是否返回结果。

    它和requests的区别?
  1. requests是一个Python实现的可以伪造浏览器发送http请求的模块
  2. -封装socket发送请求
  3.  
  4. twisted是基于事件循环的异步非阻塞网络框架
  5. -封装socket发送请求
  6. -单线程完成完成并发请求
  7. PS:三个关键词
  8. -非阻塞:不等待
  9. -异步:回调
  10. -事件循环:一直循环去检查状态

twisted和requests的区别

Scrapy

  Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 其可以应用在数据挖掘,信息处理或存储历史数据等一系列的程序中。
其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的, 也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。

Scrapy 使用了 Twisted异步网络库来处理网络通讯。整体架构大致如下:

Scrapy主要包括了以下组件:

  • 引擎(Scrapy)

    用来处理整个系统的数据流处理, 触发事务(框架核心)

  • 调度器(Scheduler)

  用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址

  • 下载器(Downloader)

  用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)

  • 爬虫(Spiders)

  爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面

  • 项目管道(Pipeline)

  负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。

  • 下载器中间件(Downloader Middlewares)

  位于Scrapy引擎和下载器之间的框架,主要是处理Scrapy引擎与下载器之间的请求及响应。

  • 爬虫中间件(Spider Middlewares)

  介于Scrapy引擎和爬虫之间的框架,主要工作是处理蜘蛛的响应输入和请求输出。

  • 调度中间件(Scheduler Middewares)

  介于Scrapy引擎和调度之间的中间件,从Scrapy引擎发送到调度的请求和响应。

Scrapy运行流程大概如下:

  1. 引擎找到要执行的爬虫,并执行爬虫的 start_requests 方法,并得到一个迭代器。
  2. 迭代器循环时会获取Request对象,而Request对象中封装了要访问的URL和回调函数,将所有的request对象(任务)放到调度器中,放入一个请求队列,同时去重。
  3. 下载器去引擎中获取要下载任务(就是Request对象),引擎向调度器获取任务,调度器从队列中取出一个Request返回引擎,引擎交给下载器
  4. 下载器下载完成后返回Response对象交给引擎执行回调函数。
  5. 回到spider的回调函数中,爬虫解析Response
  6. yield Item(): 解析出实体(Item),则交给实体管道进行进一步的处理
  7. yield Request()解析出的是链接(URL),则把URL交给调度器等待抓取

一. 基本命令及项目结构

 基本命令

  1. 建立项目:scrapy startproject 项目名称
  2. 在当前目录中创建中创建一个项目文件(类似于Django
  3. 创建爬虫应用
  4. cd 项目名称
  5. scrapy genspider [-t template] <name> <domain>
  6. scrapy gensipider -t basic oldboy oldboy.com
  7. 7        scrapy genspider -t crawl weisuen sohu.com
  1. PS:
  2. 查看所有命令:scrapy gensipider -l
  3. 查看模板命令:scrapy gensipider -d 模板名称
  4. scrapy list
  5. 展示爬虫应用列表
  6. 运行爬虫应用
  7. scrapy crawl 爬虫应用名称
  8. Scrapy crawl quotes
  9. Scrapy runspider quote
  10. scrapy crawl lagou -s JOBDIR=job_info/001 暂停与重启
  11. 保存文件:Scrapy crawl quotes o quotes.json
    19   shell脚本测试 scrapy shell 'http://scrapy.org' --nolog

详细命令请点击:详解Scrapy的命令行工具

项目结构

  1. project_name/
  2. scrapy.cfg
  3. project_name/
  4. __init__.py
  5. items.py
  6. pipelines.py
  7. settings.py
  8. spiders/
  9. __init__.py
  10. 爬虫1.py
  11. 爬虫2.py
  12. 爬虫3.py

文件说明:

  • scrapy.cfg  项目的主配置信息。(真正爬虫相关的配置信息在settings.py文件中)
  • items.py    设置数据存储模板,用于结构化数据,如:Django的Model
  • pipelines    数据处理行为,如:一般结构化的数据持久化
  • settings.py 配置文件,如:递归的层数、并发数,延迟下载等
  • spiders      爬虫目录,如:创建文件,编写爬虫规则

注意:一般创建爬虫文件时,以网站域名命名

二. spider编写

scrapy为我们提供了5中spider用于构造请求,解析数据、返回item。常用的就scrapy.spider、scrapy.crawlspider两种。下面是spider常用到的属性和方法。

scrapy.spider
属性、方法 功能 简述
name 爬虫的名称 启动爬虫的时候会用到
start_urls 起始url 是一个列表,默认被start_requests调用
allowd_doamins 对url进行的简单过滤 当请求url没有被allowd_doamins匹配到时,会报一个非常恶心的错,详见我的分布式爬虫的那一篇博客
start_requests() 第一次请求 自己的spider可以重写,突破一些简易的反爬机制
custom_settings 定制settings 可以对每个爬虫定制settings配置
from_crawler 实例化入口 在scrapy的各个组件的源码中,首先执行的就是它

关于spider我们可以定制start_requests、可以单独的设置custom_settings、也可以设置请求头、代理、cookies。这些基础的用发,之前的一些博客也都介绍到了,额外的想说一下的就是关于页面的解析。

页面的解析,就是两种,一种是请求得到的文章列表中标题中信息量不大(一些新闻、资讯类的网站),只需访问具体的文章内容。在做循环时需解析到a标签下的href属性里的url,还有一种就是一些电商网站类的,商品的列表也本身包含的信息就比较多,既要提取列表中商品的信息,又要提取到详情页中用户的评论,这时循环解析到具体的li标签,然后在解析li里面的a标签进行下一步的跟踪。

scrapy.crawlspider

文档进行了说明,其实就是利用自己定制的rules,正则匹配每次请求的url,对allowd_doamins进行的补充,常用于网站的全站爬取。关于全站的爬取还可以自己找网站url的特征来进行,博客中的新浪网的爬虫以及当当图书的爬虫都是我自己定义的url匹配规则进行的爬取。

  1. 1.start_urls
  1. 内部原理
  1. scrapy引擎来爬虫中取起始URL
  2. 1. 调用start_requests并获取返回值
  3. 2. v = iter(返回值)
  4. 3.
  5. req1 = 执行 v.__next__()
  6. req2 = 执行 v.__next__()
  7. req3 = 执行 v.__next__()
  8. ...
  9. 4. req全部放到调度器中
  1.  
  1. class ChoutiSpider(scrapy.Spider):
  2. name = 'chouti'
  3. allowed_domains = ['chouti.com']
  4. start_urls = ['https://dig.chouti.com/']
  5. cookie_dict = {}
  6.  
  7. def start_requests(self):
  8. # 方式一:
  9. for url in self.start_urls:
  10. yield Request(url=url)
  11. # 方式二:
  12. # req_list = []
  13. # for url in self.start_urls:
  14. # req_list.append(Request(url=url))
  15. # return req_list
  16.  
  17. - 定制:可以去redis中获取

两种实现方式

  1. 2. 响应:
  1. # response封装了响应相关的所有数据:
  2. - response.text
  3. - response.encoding
  4. - response.body
       - response.meta['depth':'深度']
  5. - response.request # 当前响应是由那个请求发起;请求中 封装(要访问的url,下载完成之后执行那个函数)

   3. 选择器

  1. from scrapy.selector import Selector
  2. from scrapy.http import HtmlResponse
  3.  
  4. html = """<!DOCTYPE html>
  5. <html>
  6. <head lang="en">
  7. <meta charset="UTF-8">
  8. <title></title>
  9. </head>
  10. <body>
  11. <ul>
  12. <li class="item-"><a id='i1' href="link.html">first item</a></li>
  13. <li class="item-0"><a id='i2' href="llink.html">first item</a></li>
  14. <li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li>
  15. </ul>
  16. <div><a href="llink2.html">second item</a></div>
  17. </body>
  18. </html>
  19. """
  20. response = HtmlResponse(url='http://example.com', body=html, encoding='utf-8')
  21. # hxs = Selector(response=response).xpath('//a')
  22. # print(hxs)
  23. # hxs = Selector(text=html).xpath('//a')
  24. # print(hxs)
  25. # hxs = Selector(response=response).xpath('//a[2]')
  26. # print(hxs)
  27. # hxs = Selector(response=response).xpath('//a[@id]')
  28. # print(hxs)
  29. # hxs = Selector(response=response).xpath('//a[@id="i1"]')
  30. # print(hxs)
  31. # hxs = Selector(response=response).xpath('//a[@href="link.html"][@id="i1"]')
  32. # print(hxs)
  33. # hxs = Selector(response=response).xpath('//a[contains(@href, "link")]')
  34. # print(hxs)
  35. # hxs = Selector(response=response).xpath('//a[starts-with(@href, "link")]')
  36. # print(hxs)
  37. # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]')
  38. # print(hxs)
  39. # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/text()').extract()
  40. # print(hxs)
  41. # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/@href').extract()
  42. # print(hxs)
  43. # hxs = Selector(response=response).xpath('/html/body/ul/li/a/@href').extract()
  44. # print(hxs)
  45. # hxs = Selector(response=response).xpath('//body/ul/li/a/@href').extract_first()
  46. # print(hxs)
  47.  
  48. # ul_list = Selector(response=response).xpath('//body/ul/li')
  49. # for item in ul_list:
  50. # v = item.xpath('./a/span')
  51. # # 或
  52. # # v = item.xpath('a/span')
  53. # # 或
  54. # # v = item.xpath('*/a/span')
  55. # print(v)

response.css('...') 返回一个response xpath对象
response.css('....').extract() 返回一个列表
response.css('....').extract_first() 提取列表中的元素

  1. def parse_detail(self, response):
  2. # items = JobboleArticleItem()
  3. # title = response.xpath('//div[@class="entry-header"]/h1/text()')[0].extract()
  4. # create_date = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/text()').extract()[0].strip().replace('·','').strip()
  5. # praise_nums = int(response.xpath("//span[contains(@class,'vote-post-up')]/h10/text()").extract_first())
  6. # fav_nums = response.xpath("//span[contains(@class,'bookmark-btn')]/text()").extract_first()
  7. # try:
  8. # if re.match('.*?(\d+).*', fav_nums).group(1):
  9. # fav_nums = int(re.match('.*?(\d+).*', fav_nums).group(1))
  10. # else:
  11. # fav_nums = 0
  12. # except:
  13. # fav_nums = 0
  14. # comment_nums = response.xpath('//a[contains(@href,"#article-comment")]/span/text()').extract()[0]
  15. # try:
  16. # if re.match('.*?(\d+).*',comment_nums).group(1):
  17. # comment_nums = int(re.match('.*?(\d+).*',comment_nums).group(1))
  18. # else:
  19. # comment_nums = 0
  20. # except:
  21. # comment_nums = 0
  22. # contente = response.xpath('//div[@class="entry"]').extract()[0]
  23. # tag_list = response.xpath('//p[@class="entry-meta-hide-on-mobile"]/a/text()').extract()
  24. # tag_list = [tag for tag in tag_list if not tag.strip().endswith('评论')]
  25. # tags = ",".join(tag_list)
  26. # items['title'] = title
  27. # try:
  28. # create_date = datetime.datetime.strptime(create_date,'%Y/%m/%d').date()
  29. # except:
  30. # create_date = datetime.datetime.now()
  31. # items['date'] = create_date
  32. # items['url'] = response.url
  33. # items['url_object_id'] = get_md5(response.url)
  34. # items['img_url'] = [img_url]
  35. # items['praise_nums'] = praise_nums
  36. # items['fav_nums'] = fav_nums
  37. # items['comment_nums'] = comment_nums
  38. # items['content'] = contente
  39. # items['tags'] = tags

xpath解析jobble

  1. # title = response.css('.entry-header h1::text')[0].extract()
  2. # create_date = response.css('p.entry-meta-hide-on-mobile::text').extract()[0].strip().replace('·','').strip()
  3. # praise_nums = int(response.css(".vote-post-up h10::text").extract_first()
  4. # fav_nums = response.css(".bookmark-btn::text").extract_first()
  5. # if re.match('.*?(\d+).*', fav_nums).group(1):
  6. # fav_nums = int(re.match('.*?(\d+).*', fav_nums).group(1))
  7. # else:
  8. # fav_nums = 0
  9. # comment_nums = response.css('a[href="#article-comment"] span::text').extract()[0]
  10. # if re.match('.*?(\d+).*', comment_nums).group(1):
  11. # comment_nums = int(re.match('.*?(\d+).*', comment_nums).group(1))
  12. # else:
  13. # comment_nums = 0
  14. # content = response.css('.entry').extract()[0]
  15. # tag_list = response.css('p.entry-meta-hide-on-mobile a::text')
  16. # tag_list = [tag for tag in tag_list if not tag.strip().endswith('评论')]
  17. # tags = ",".join(tag_list)
  18. # xpath选择器 /@href /text()

css解析jobbole

  1. def parse_detail(self, response):
  2. img_url = response.meta.get('img_url','')
  3. item_loader = ArticleItemLoader(item=JobboleArticleItem(), response=response)
  4. item_loader.add_css("title", ".entry-header h1::text")
  5. item_loader.add_value('url',response.url)
  6. item_loader.add_value('url_object_id', get_md5(response.url))
  7. item_loader.add_css('date', 'p.entry-meta-hide-on-mobile::text')
  8. item_loader.add_value("img_url", [img_url])
  9. item_loader.add_css("praise_nums", ".vote-post-up h10::text")
  10. item_loader.add_css("fav_nums", ".bookmark-btn::text")
  11. item_loader.add_css("comment_nums", "a[href='#article-comment'] span::text")
  12. item_loader.add_css("tags", "p.entry-meta-hide-on-mobile a::text")
  13. item_loader.add_css("content", "div.entry")
  14. items = item_loader.load_item()
  15. yield items

item_loader版本

  4. 再次发起请求

  1.   yield Request(url='xxxx',callback=self.parse)
      yield Request(url=parse.urljoin(response.url,post_url), meta={'img_url':img_url}, callback=self.parse_detail)

  5. 携带cookie

1.settings
settings文件中给Cookies_enabled=False解注释
settings的headers配置的cookie就可以用了

  1. COOKIES_ENABLED = False
  2.  
  3. # Override the default request headers:
  4. DEFAULT_REQUEST_HEADERS = {
  5. 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
  6. 'Accept-Encoding': 'gzip, deflate',
  7. 'Accept-Language': 'zh-CN,zh;q=0.9',
  8. 'Cookie': '_ga=GA1.2.1597380101.1571015417; Hm_lvt_2207ecfb7b2633a3bc5c4968feb58569=1571015417; _emuch_index=1; _lastlogin=1; view_tid=11622561; _discuz_uid=13260353; _discuz_pw=bce925aa772ee8f7; last_ip=111.53.196.3_13260353; discuz_tpl=qing; _last_fid=189; _discuz_cc=29751646316289719; Hm_lpvt_2207ecfb7b2633a3bc5c4968feb58569=1571367951; _gat=1'
  9. }

2.DownloadMiddleware

settings中的Cookies_enabled=True
settings中给downloadmiddleware解注释
去中间件文件中找downloadmiddleware这个类,修改process_request,添加request.cookies={}即可。

3.爬虫主文件中重写start_request

方式一:携带

  1. cookie_dict
  2. cookie_jar = CookieJar()
  3. cookie_jar.extract_cookies(response, response.request)
  4.  
  5. # 去对象中将cookie解析到字典
  6. for k, v in cookie_jar._cookies.items():
  7. for i, j in v.items():
  8. for m, n in j.items():
  9. cookie_dict[m] = n.value

解析cookie

  1. yield Request(
  2. url='https://dig.chouti.com/login',
  3. method='POST',
  4. body='phone=8615735177116&password=zyf123&oneMonth=1',
  5. headers={'content-type': 'application/x-www-form-urlencoded; charset=UTF-8'},
  6. # cookies=cookie_obj._cookies,
  7. cookies=self.cookies_dict,
  8. callback=self.check_login,
  9. )

携带

方式二: meta

  1. yield Request(url=url, callback=self.login, meta={'cookiejar': True})

  6. 回调函数传递值:meta

  1. def parse(self, response):
      yield scrapy.Request(url=parse.urljoin(response.url,post_url), meta={'img_url':img_url}, callback=self.parse_detail)
  2.  
  3. def parse_detail(self, response):
  4. img_url = response.meta.get('img_url','')
  1. from urllib.parse import urljoin
  2.  
  3. import scrapy
  4. from scrapy import Request
  5. from scrapy.http.cookies import CookieJar
  6.  
  7. class SpiderchoutiSpider(scrapy.Spider):
  8. name = 'choutilike'
  9. allowed_domains = ['dig.chouti.com']
  10. start_urls = ['https://dig.chouti.com/']
  11.  
  12. cookies_dict = {}
  13.  
  14. def parse(self, response):
  15. # 去响应头中获取cookie,cookie保存在cookie_jar对象
  16. cookie_obj = CookieJar()
  17. cookie_obj.extract_cookies(response, response.request)
  18.  
  19. # 去对象中将cookie解析到字典
  20. for k, v in cookie_obj._cookies.items():
  21. for i, j in v.items():
  22. for m, n in j.items():
  23. self.cookies_dict[m] = n.value
  24.  
  25. # self.cookies_dict = cookie_obj._cookies
  26.  
  27. yield Request(
  28. url='https://dig.chouti.com/login',
  29. method='POST',
  30. body='phone=8615735177116&password=zyf123&oneMonth=1',
  31. headers={'content-type': 'application/x-www-form-urlencoded; charset=UTF-8'},
  32. # cookies=cookie_obj._cookies,
  33. cookies=self.cookies_dict,
  34. callback=self.check_login,
  35. )
  36.  
  37. def check_login(self,response):
  38. # print(response.text)
  39. yield Request(url='https://dig.chouti.com/all/hot/recent/1',
  40. cookies=self.cookies_dict,
  41. callback=self.good,
  42. )
  43.  
  44. def good(self,response):
  45. id_list = response.css('div.part2::attr(share-linkid)').extract()
  46. for id in id_list:
  47. url = 'https://dig.chouti.com/link/vote?linksId={}'.format(id)
  48. yield Request(
  49. url=url,
  50. method='POST',
  51. cookies=self.cookies_dict,
  52. callback=self.show,
  53. )
  54. pages = response.css('#dig_lcpage a::attr(href)').extract()
  55. for page in pages:
  56. url = urljoin('https://dig.chouti.com/',page)
  57. yield Request(url=url,callback=self.good)
  58.  
  59. def show(self,response):
  60. print(response.text)

抽屉自动登录并点赞chouti.py

三、持久化

   1. 书写顺序

  • a. 先写pipeline类
  • b. 写Item类
  1. import scrapy
  2.  
  3. class ChoutiItem(scrapy.Item):
  4. # define the fields for your item here like:
  5. # name = scrapy.Field()
  6. title = scrapy.Field()
  7. href = scrapy.Field()

items.py

  • c. 配置settings
  1. ITEM_PIPELINES = {
  2. # 'chouti.pipelines.XiaohuaImagesPipeline': 300,
  3. # 'scrapy.pipelines.images.ImagesPipeline': 1,
  4. 'chouti.pipelines.ChoutiPipeline': 300,
  5. # 'chouti.pipelines.Chouti2Pipeline': 301,
  6. }

ITEM_PIPELINES

  • d. 爬虫,yield每执行一次,process_item就调用一次。
  1. yield Item对象

   2. pipeline的编写                     

  1. 源码执行流程
  1. 1. 判断当前XdbPipeline类中是否有from_crawler
  2. 有:
  3. obj = XdbPipeline.from_crawler(....)
  4. 否:
  5. obj = XdbPipeline()
  6. 2. obj.open_spider()
  7.  
  8. 3. obj.process_item()/obj.process_item()/obj.process_item()/obj.process_item()/obj.process_item()
  9.  
  10. 4. obj.close_spider()
  1. import scrapy
  2. from scrapy.pipelines.images import ImagesPipeline
  3. from scrapy.exceptions import DropItem
  4.  
  5. class ChoutiPipeline(object):
  6. def __init__(self,conn_str):
  7. self.conn_str = conn_str
  8.  
  9. @classmethod
  10. def from_crawler(cls,crawler):
  11. """
  12. 初始化的时候,用于创建pipeline对象
  13. :param crawler:
  14. :return:
  15. """
  16. conn_str = crawler.settings.get('DB')
  17. return cls(conn_str)
  18.  
  19. def open_spider(self,spider):
  20. """
  21. 爬虫开始时调用
  22. :param spider:
  23. :return:
  24. """
  25. self.conn = open(self.conn_str,'a',encoding='utf-8')
  26.  
  27. def process_item(self, item, spider):
  28. if spider.name == 'spiderchouti':
  29. self.conn.write('{}\n{}\n'.format(item['title'],item['href']))
  30. #交给下一个pipline使用
  31. return item
  32. #丢弃item,不交给下一个pipline
  33. # raise DropItem()
  34.  
  35. def close_spider(self,spider):
  36. """
  37. 爬虫关闭时调用
  38. :param spider:
  39. :return:
  40. """
  41. self.conn.close()

存储文件pipeline

注意:pipeline是所有爬虫公用,如果想要给某个爬虫定制需要使用spider参数自己进行处理。

json文件

  1. class JsonExporterPipeline(JsonItemExporter):
  2. #调用scrapy提供的json export 导出json文件
  3. def __init__(self):
  4. self.file = open('articleexpoter.json','wb')
  5. self.exporter = JsonItemExporter(self.file, encoding='utf-8', ensure_ascii=False)
  6. self.exporter.start_exporting()#开始导出
  7.  
  8. def close_spider(self):
  9. self.exporter.finish_exporting() #停止导出
  10. self.file.close()
  11.  
  12. def process_item(self, item, spider):
  13. self.exporter.export_item(item)
  14. return item
  15.  
  16. class JsonWithEncodingPipeline(object):
  17. #自定义json文件的导出
  18. def __init__(self):
  19. self.file = codecs.open('article.json','w',encoding='utf-8')
  20.  
  21. def process_item(self,item,spider):
  22. lines = json.dumps(dict(item), ensure_ascii=False) + '\n'
  23. self.file.write(lines)
  24. return item
  25.  
  26. def spider_closed(self):
  27. self.file.close()

pipeline

存储图片

  1. # -*- coding: utf-8 -*-
  2. from urllib.parse import urljoin
  3.  
  4. import scrapy
  5. from ..items import XiaohuaItem
  6.  
  7. class XiaohuaSpider(scrapy.Spider):
  8. name = 'xiaohua'
  9. allowed_domains = ['www.xiaohuar.com']
  10. start_urls = ['http://www.xiaohuar.com/list-1-{}.html'.format(i) for i in range(11)]
  11.  
  12. def parse(self, response):
  13. items = response.css('.item_list .item')
  14. for item in items:
  15. url = item.css('.img img::attr(src)').extract()[0]
  16. url = urljoin('http://www.xiaohuar.com',url)
  17. title = item.css('.title span a::text').extract()[0]
  18. obj = XiaohuaItem(img_url=[url],title=title)
  19. yield obj

spider

  1. class XiaohuaItem(scrapy.Item):
  2. img_url = scrapy.Field()
  3. title = scrapy.Field()
  4. img_path = scrapy.Field()

item

  1. class XiaohuaImagesPipeline(ImagesPipeline):
  2. #调用scrapy提供的imagepipeline下载图片
  3. def item_completed(self, results, item, info):
  4. if "img_url" in item:
  5. for ok,value in results:
  6. print(ok,value)
  7. img_path = value['path']
  8. item['img_path'] = img_path
  9. return item
  10.  
  11. def get_media_requests(self, item, info): # 下载图片
  12. if "img_url" in item:
  13. for img_url in item['img_url']:
  14. yield scrapy.Request(img_url, meta={'item': item, 'index': item['img_url'].index(img_url)}) # 添加meta是为了下面重命名文件名使用
  15.  
  16. def file_path(self, request, response=None, info=None):
  17. item = request.meta['item']
  18. if "img_url" in item:# 通过上面的meta传递过来item
  19. index = request.meta['index'] # 通过上面的index传递过来列表中当前下载图片的下标
  20.  
  21. # 图片文件名,item['carname'][index]得到汽车名称,request.url.split('/')[-1].split('.')[-1]得到图片后缀jpg,png
  22. image_guid = item['title'] + '.' + request.url.split('/')[-1].split('.')[-1]
  23. # 图片下载目录 此处item['country']即需要前面item['country']=''.join()......,否则目录名会变成\u97e9\u56fd\u6c7d\u8f66\u6807\u5fd7\xxx.jpg
  24. filename = u'full/{0}'.format(image_guid)
  25. return filename

pipeline

  1. ITEM_PIPELINES = {
  2. # 'chouti.pipelines.XiaohuaImagesPipeline': 300,
  3. 'scrapy.pipelines.images.ImagesPipeline': 1,
  4. }

ITEM_PIPELINES

mysql数据库

  1. class MysqlPipeline(object):
  2. def __init__(self):
  3. self.conn = pymysql.connect('localhost', 'root','', 'crawed', charset='utf8', use_unicode=True)
  4. self.cursor = self.conn.cursor()
  5.  
  6. def process_item(self, item, spider):
  7. insert_sql = """insert into article(title,url,create_date,fav_nums) values (%s,%s,%s,%s)"""
  8. self.cursor.execute(insert_sql,(item['title'],item['url'],item['date'],item['fav_nums']))
  9. self.conn.commit()
  10.  
  11. class MysqlTwistePipeline(object):
  12. def __init__(self, dbpool):
  13. self.dbpool = dbpool
  14.  
  15. @classmethod
  16. def from_settings(cls,settings):
  17. dbparms = dict(
  18. host=settings['MYSQL_HOST'],
  19. db=settings['MYSQL_DB'],
  20. user=settings['MYSQL_USER'],
  21. password=settings['MYSQL_PASSWORD'],
  22. charset='utf8',
  23. cursorclass=pymysql.cursors.DictCursor,
  24. use_unicode=True,
  25. )
  26. dbpool = adbapi.ConnectionPool('pymysql',**dbparms)
  27. return cls(dbpool)
  28.  
  29. def process_item(self, item, spider):
  30. #使用twisted将mysql插入变异步执行
  31. query = self.dbpool.runInteraction(self.do_insert,item)
  32. # query.addErrorback(self.handle_error) #处理异常
  33. query.addErrback(self.handle_error) #处理异常
  34.  
  35. def handle_error(self,failure):
  36. #处理异步插入的异常
  37. print(failure)
  38.  
  39. def do_insert(self,cursor,item):
  40. insert_sql, params = item.get_insert_sql()
  41. try:
  42. cursor.execute(insert_sql,params)
  43. print('插入成功')
  44. except Exception as e:
  45. print('插入失败')

pipeline

  1. MYSQL_HOST = 'localhost'
  2. MYSQL_USER = 'root'
  3. MYSQL_PASSWORD = ''
  4. MYSQL_DB = 'crawed'
  5.  
  6. SQL_DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S"
  7. SQL_DATE_FORMAT = "%Y-%m-%d"
  8. RANDOM_UA_TYPE = "random"
  9. ES_HOST = "127.0.0.1"

settings

四、去重规则

  1. Scrapy默认去重规则:
  1. from scrapy.dupefilter import RFPDupeFilter
  1. from __future__ import print_function
  2. import os
  3. import logging
  4.  
  5. from scrapy.utils.job import job_dir
  6. from scrapy.utils.request import request_fingerprint
  7.  
  8. class BaseDupeFilter(object):
  9.  
  10. @classmethod
  11. def from_settings(cls, settings):
  12. return cls()
  13.  
  14. def request_seen(self, request):
  15. return False
  16.  
  17. def open(self): # can return deferred
  18. pass
  19.  
  20. def close(self, reason): # can return a deferred
  21. pass
  22.  
  23. def log(self, request, spider): # log that a request has been filtered
  24. pass
  25.  
  26. class RFPDupeFilter(BaseDupeFilter):
  27. """Request Fingerprint duplicates filter"""
  28.  
  29. def __init__(self, path=None, debug=False):
  30. self.file = None
  31. self.fingerprints = set()
  32. self.logdupes = True
  33. self.debug = debug
  34. self.logger = logging.getLogger(__name__)
  35. if path:
  36. self.file = open(os.path.join(path, 'requests.seen'), 'a+')
  37. self.file.seek(0)
  38. self.fingerprints.update(x.rstrip() for x in self.file)
  39.  
  40. @classmethod
  41. def from_settings(cls, settings):
  42. debug = settings.getbool('DUPEFILTER_DEBUG')
  43. return cls(job_dir(settings), debug)
  44.  
  45. def request_seen(self, request):
  46. fp = self.request_fingerprint(request)
  47. if fp in self.fingerprints:
  48. return True
  49. self.fingerprints.add(fp)
  50. if self.file:
  51. self.file.write(fp + os.linesep)
  52.  
  53. def request_fingerprint(self, request):
  54. return request_fingerprint(request)
  55.  
  56. def close(self, reason):
  57. if self.file:
  58. self.file.close()
  59.  
  60. def log(self, request, spider):
  61. if self.debug:
  62. msg = "Filtered duplicate request: %(request)s"
  63. self.logger.debug(msg, {'request': request}, extra={'spider': spider})
  64. elif self.logdupes:
  65. msg = ("Filtered duplicate request: %(request)s"
  66. " - no more duplicates will be shown"
  67. " (see DUPEFILTER_DEBUG to show all duplicates)")
  68. self.logger.debug(msg, {'request': request}, extra={'spider': spider})
  69. self.logdupes = False
  70.  
  71. spider.crawler.stats.inc_value('dupefilter/filtered', spider=spider)

dupefilters

  1. 自定义去重规则
    1.编写类
  1. # -*- coding: utf-8 -*-
  2.  
  3. """
  4. @Datetime: 2018/8/31
  5. @Author: Zhang Yafei
  6. """
  7. from scrapy.dupefilter import BaseDupeFilter
  8. from scrapy.utils.request import request_fingerprint
  9.  
  10. class RepeatFilter(BaseDupeFilter):
  11.  
  12. def __init__(self):
  13. self.visited_fd = set()
  14.  
  15. @classmethod
  16. def from_settings(cls, settings):
  17. return cls()
  18.  
  19. def request_seen(self, request):
  20. fd = request_fingerprint(request=request)
  21. if fd in self.visited_fd:
  22. return True
  23. self.visited_fd.add(fd)
  24.  
  25. def open(self): # can return deferred
  26. print('open')
  27. pass
  28.  
  29. def close(self, reason): # can return a deferred
  30. print('close')
  31. pass
  32.  
  33. def log(self, request, spider): # log that a request has been filtered
  34. pass

dupeFilter.py

 2. 配置

  1. # 默认去重规则
  2. # DUPEFILTER_CLASS = "chouti.duplication.RepeatFilter"
  3. DUPEFILTER_CLASS = "chouti.dupeFilter.RepeatFilter"

   3. 爬虫使用

  1. from urllib.parse import urljoin
  2. from ..items import ChoutiItem
  3. import scrapy
  4. from scrapy.http import Request
  5.  
  6. class SpiderchoutiSpider(scrapy.Spider):
  7. name = 'spiderchouti'
  8. allowed_domains = ['dig.chouti.com']
  9. start_urls = ['https://dig.chouti.com/']
  10.  
  11. def parse(self, response):
  12. #获取当前页的标题
  13. print(response.request.url)
  14. # news = response.css('.content-list .item')
  15. # for new in news:
  16. # title = new.css('.show-content::text').extract()[0].strip()
  17. # href = new.css('.show-content::attr(href)').extract()[0]
  18. # item = ChoutiItem(title=title,href=href)
  19. # yield item
  20.  
  21. #获取所有页码
  22. pages = response.css('#dig_lcpage a::attr(href)').extract()
  23. for page in pages:
  24. url = urljoin(self.start_urls[0],page)
  25. #将新要访问的url添加到调度器
  26. yield Request(url=url,callback=self.parse)

chouti.py

注意:

  • request_seen中编写正确逻辑
  • dont_filter=False

五、中间件

  下载中间件

  1. from scrapy.http import HtmlResponse
  2. from scrapy.http import Request
  3.  
  4. class Md1(object):
  5. @classmethod
  6. def from_crawler(cls, crawler):
  7. # This method is used by Scrapy to create your spiders.
  8. s = cls()
  9. return s
  10.  
  11. def process_request(self, request, spider):
  12. # Called for each request that goes through the downloader
  13. # middleware.
  14.  
  15. # Must either:
  16. # - return None: continue processing this request
  17. # - or return a Response object
  18. # - or return a Request object
  19. # - or raise IgnoreRequest: process_exception() methods of
  20. # installed downloader middleware will be called
  21. print('md1.process_request',request)
  22. # 1. 返回Response
  23. # import requests
  24. # result = requests.get(request.url)
  25. # return HtmlResponse(url=request.url, status=200, headers=None, body=result.content)
  26. # 2. 返回Request
  27. # return Request('https://dig.chouti.com/r/tec/hot/1')
  28.  
  29. # 3. 抛出异常
  30. # from scrapy.exceptions import IgnoreRequest
  31. # raise IgnoreRequest
  32.  
  33. # 4. 对请求进行加工(*)
  34. # request.headers['user-agent'] = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
  35.  
  36. pass
  37.  
  38. def process_response(self, request, response, spider):
  39. # Called with the response returned from the downloader.
  40.  
  41. # Must either;
  42. # - return a Response object
  43. # - return a Request object
  44. # - or raise IgnoreRequest
  45. print('m1.process_response',request,response)
  46. return response
  47.  
  48. def process_exception(self, request, exception, spider):
  49. # Called when a download handler or a process_request()
  50. # (from other downloader middleware) raises an exception.
  51.  
  52. # Must either:
  53. # - return None: continue processing this exception
  54. # - return a Response object: stops process_exception() chain
  55. # - return a Request object: stops process_exception() chain
  56. pass

downloadmiddleware.py

  1. DOWNLOADER_MIDDLEWARES = {
  2. #'xdb.middlewares.XdbDownloaderMiddleware': 543,
  3. # 'xdb.proxy.XdbProxyMiddleware':751,
  4. 'xdb.md.Md1':666,
  5. 'xdb.md.Md2':667,
  6. }

配置

  1. 应用:- user-agent
  2. - 代理

爬虫中间件

  1. class Sd1(object):
  2. # Not all methods need to be defined. If a method is not defined,
  3. # scrapy acts as if the spider middleware does not modify the
  4. # passed objects.
  5.  
  6. @classmethod
  7. def from_crawler(cls, crawler):
  8. # This method is used by Scrapy to create your spiders.
  9. s = cls()
  10. return s
  11.  
  12. def process_spider_input(self, response, spider):
  13. # Called for each response that goes through the spider
  14. # middleware and into the spider.
  15.  
  16. # Should return None or raise an exception.
  17. return None
  18.  
  19. def process_spider_output(self, response, result, spider):
  20. # Called with the results returned from the Spider, after
  21. # it has processed the response.
  22.  
  23. # Must return an iterable of Request, dict or Item objects.
  24. for i in result:
  25. yield i
  26.  
  27. def process_spider_exception(self, response, exception, spider):
  28. # Called when a spider or process_spider_input() method
  29. # (from other spider middleware) raises an exception.
  30.  
  31. # Should return either None or an iterable of Response, dict
  32. # or Item objects.
  33. pass
  34.  
  35. # 只在爬虫启动时,执行一次。
  36. def process_start_requests(self, start_requests, spider):
  37. # Called with the start requests of the spider, and works
  38. # similarly to the process_spider_output() method, except
  39. # that it doesn’t have a response associated.
  40.  
  41. # Must return only requests (not items).
  42. for r in start_requests:
  43. yield r

spidermiddleware.py

  1. SPIDER_MIDDLEWARES = {
  2. # 'xdb.middlewares.XdbSpiderMiddleware': 543,
  3. 'xdb.sd.Sd1': 666,
  4. 'xdb.sd.Sd2': 667,
  5. }

配置

  1. 应用:
  2. - 深度
  3. - 优先级
  1. class SpiderMiddleware(object):
  2.  
  3. def process_spider_input(self,response, spider):
  4. """
  5. 下载完成,执行,然后交给parse处理
  6. :param response:
  7. :param spider:
  8. :return:
  9. """
  10. pass
  11.  
  12. def process_spider_output(self,response, result, spider):
  13. """
  14. spider处理完成,返回时调用
  15. :param response:
  16. :param result:
  17. :param spider:
  18. :return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)
  19. """
  20. return result
  21.  
  22. def process_spider_exception(self,response, exception, spider):
  23. """
  24. 异常调用
  25. :param response:
  26. :param exception:
  27. :param spider:
  28. :return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline
  29. """
  30. return None
  31.  
  32. def process_start_requests(self,start_requests, spider):
  33. """
  34. 爬虫启动时调用
  35. :param start_requests:
  36. :param spider:
  37. :return: 包含 Request 对象的可迭代对象
  38. """
  39. return start_requests

爬虫中间件

  1. class DownMiddleware1(object):
  2. def process_request(self, request, spider):
  3. """
  4. 请求需要被下载时,经过所有下载器中间件的process_request调用
  5. :param request:
  6. :param spider:
  7. :return:
  8. None,继续后续中间件去下载;
  9. Response对象,停止process_request的执行,开始执行process_response
  10. Request对象,停止中间件的执行,将Request重新调度器
  11. raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception
  12. """
  13. pass
  14.  
  15. def process_response(self, request, response, spider):
  16. """
  17. spider处理完成,返回时调用
  18. :param response:
  19. :param result:
  20. :param spider:
  21. :return:
  22. Response 对象:转交给其他中间件process_response
  23. Request 对象:停止中间件,request会被重新调度下载
  24. raise IgnoreRequest 异常:调用Request.errback
  25. """
  26. print('response1')
  27. return response
  28.  
  29. def process_exception(self, request, exception, spider):
  30. """
  31. 当下载处理器(download handler)或 process_request() (下载中间件)抛出异常
  32. :param response:
  33. :param exception:
  34. :param spider:
  35. :return:
  36. None:继续交给后续中间件处理异常;
  37. Response对象:停止后续process_exception方法
  38. Request对象:停止中间件,request将会被重新调用下载
  39. """
  40. return None

下载器中间件

  1. 设置代理
  1. 在爬虫启动时,提前在os.envrion中设置代理即可。
  2. class ChoutiSpider(scrapy.Spider):
  3. name = 'chouti'
  4. allowed_domains = ['chouti.com']
  5. start_urls = ['https://dig.chouti.com/']
  6. cookie_dict = {}
  7.  
  8. def start_requests(self):
  9. import os
  10. os.environ['HTTPS_PROXY'] = "http://root:woshiniba@192.168.11.11:9999/"
  11. os.environ['HTTP_PROXY'] = '19.11.2.32',
  12. for url in self.start_urls:
  13. yield Request(url=url,callback=self.parse)
  14. meta:
  15. class ChoutiSpider(scrapy.Spider):
  16. name = 'chouti'
  17. allowed_domains = ['chouti.com']
  18. start_urls = ['https://dig.chouti.com/']
  19. cookie_dict = {}
  20.  
  21. def start_requests(self):
  22. for url in self.start_urls:
  23. yield Request(url=url,callback=self.parse,meta={'proxy':'"http://root:woshiniba@192.168.11.11:9999/"'})

内置方式(2种)

  1. import base64
  2. import random
  3. from six.moves.urllib.parse import unquote
  4. try:
  5. from urllib2 import _parse_proxy
  6. except ImportError:
  7. from urllib.request import _parse_proxy
  8. from six.moves.urllib.parse import urlunparse
  9. from scrapy.utils.python import to_bytes
  10.  
  11. class XdbProxyMiddleware(object):
  12.  
  13. def _basic_auth_header(self, username, password):
  14. user_pass = to_bytes(
  15. '%s:%s' % (unquote(username), unquote(password)),
  16. encoding='latin-1')
  17. return base64.b64encode(user_pass).strip()
  18.  
  19. def process_request(self, request, spider):
  20. PROXIES = [
  21. "http://root:woshiniba@192.168.11.11:9999/",
  22. "http://root:woshiniba@192.168.11.12:9999/",
  23. "http://root:woshiniba@192.168.11.13:9999/",
  24. "http://root:woshiniba@192.168.11.14:9999/",
  25. "http://root:woshiniba@192.168.11.15:9999/",
  26. "http://root:woshiniba@192.168.11.16:9999/",
  27. ]
  28. url = random.choice(PROXIES)
  29.  
  30. orig_type = ""
  31. proxy_type, user, password, hostport = _parse_proxy(url)
  32. proxy_url = urlunparse((proxy_type or orig_type, hostport, '', '', '', ''))
  33.  
  34. if user:
  35. creds = self._basic_auth_header(user, password)
  36. else:
  37. creds = None
  38. request.meta['proxy'] = proxy_url
  39. if creds:
  40. request.headers['Proxy-Authorization'] = b'Basic ' + creds
  41.  
  42. class DdbProxyMiddleware(object):
  43. def process_request(self, request, spider):
  44. PROXIES = [
  45. {'ip_port': '111.11.228.75:80', 'user_pass': ''},
  46. {'ip_port': '120.198.243.22:80', 'user_pass': ''},
  47. {'ip_port': '111.8.60.9:8123', 'user_pass': ''},
  48. {'ip_port': '101.71.27.120:80', 'user_pass': ''},
  49. {'ip_port': '122.96.59.104:80', 'user_pass': ''},
  50. {'ip_port': '122.224.249.122:8088', 'user_pass': ''},
  51. ]
  52. proxy = random.choice(PROXIES)
  53. if proxy['user_pass'] is not None:
  54. request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
  55. encoded_user_pass = base64.b64encode(to_bytes(proxy['user_pass']))
  56. request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)
  57. else:
  58. request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])

自定义代理

六、定制命令

  单爬虫运行  main.py

  1. from scrapy.cmdline import execute
  2. import sys
  3. import os
  4.  
  5. sys.path.append(os.path.dirname(__file__))
  6.  
  7. # execute(['scrapy','crawl','spiderchouti','--nolog'])
  8. # os.system('scrapy crawl spiderchouti')
  9. # os.system('scrapy crawl xiaohua')
  10. os.system('scrapy crawl choutilike --nolog')

   所有爬虫:

  1. 在spiders同级创建任意目录,如:commands
  2. 在其中创建 crawlall.py 文件 (此处文件名就是自定义的命令)
  3. 在settings.py 中添加配置 COMMANDS_MODULE = '项目名称.目录名称'
  4. 在项目目录执行命令:scrapy crawlall
  1. # -*- coding: utf-8 -*-
  2.  
  3. """
  4. @Datetime: 2018/9/1
  5. @Author: Zhang Yafei
  6. """
  7. from scrapy.commands import ScrapyCommand
  8. from scrapy.utils.project import get_project_settings
  9.  
  10. class Command(ScrapyCommand):
  11.  
  12. requires_project = True
  13.  
  14. def syntax(self):
  15. return '[options]'
  16.  
  17. def short_desc(self):
  18. return 'Runs all of the spiders'
  19.  
  20. def run(self, args, opts):
  21. print(type(self.crawler_process))
  22. from scrapy.crawler import CrawlerProcess
  23. # 1. 执行CrawlerProcess构造方法
  24. # 2. CrawlerProcess对象(含有配置文件)的spiders
  25. # 2.1,为每个爬虫创建一个Crawler
  26. # 2.2,执行 d = Crawler.crawl(...) # ************************ #
  27. # d.addBoth(_done)
  28. # 2.3, CrawlerProcess对象._active = {d,}
  29.  
  30. # 3. dd = defer.DeferredList(self._active)
  31. # dd.addBoth(self._stop_reactor) # self._stop_reactor ==> reactor.stop()
  32.  
  33. # reactor.run
  34.  
  35. #找到所有的爬虫名称
  36. spider_list = self.crawler_process.spiders.list()
  37. # spider_list = ['choutilike','xiaohua']爬取任意项目
  38. for name in spider_list:
  39. self.crawler_process.crawl(name, **opts.__dict__)
  40. self.crawler_process.start()

crawlall.py

七、信号

  1. 信号就是使用框架预留的位置,帮助你自定义一些功能。
    内置信号
  1. # 引擎开始和结束
  2. engine_started = object()
  3. engine_stopped = object()
  4.  
  5. # spider开始和结束
  6. spider_opened = object()
  7. # 请求闲置
  8. spider_idle = object()
  9. # spider关闭
  10. spider_closed = object()
  11. # spider发生异常
  12. spider_error = object()
  13.  
  14. # 请求放入调度器
  15. request_scheduled = object()
  16. # 请求被丢弃
  17. request_dropped = object()
  18. # 接收到响应
  19. response_received = object()
  20. # 响应下载完
  21. response_downloaded = object()
  22. # item
  23. item_scraped = object()
  24. # item被丢弃
  25. item_dropped = object()
  1.  

自定义扩展

  1. class MyExtend():
  2. def __init__(self,crawler):
  3. self.crawler = crawler
  4. #钩子上挂障碍物
  5. #在指定信息上注册操作
  6. crawler.signals.connect(self.start,signals.engine_started)
  7. crawler.signals.connect(self.close,signals.engine_stopped)
  8.  
  9. @classmethod
  10. def from_crawler(cls,crawler):
  11. return cls(crawler)
  12.  
  13. def start(self):
  14. print('signals.engine_started start')
  15.  
  16. def close(self):
  17. print('signals.engine_stopped close')

扩展一

  1. from scrapy import signals
  2.  
  3. class MyExtend(object):
  4. def __init__(self):
  5. pass
  6.  
  7. @classmethod
  8. def from_crawler(cls, crawler):
  9. self = cls()
  10.  
  11. crawler.signals.connect(self.x1, signal=signals.spider_opened)
  12. crawler.signals.connect(self.x2, signal=signals.spider_closed)
  13.  
  14. return self
  15.  
  16. def x1(self, spider):
  17. print('open')
  18.  
  19. def x2(self, spider):
  20. print('close')

扩展二

配置

  1. EXTENSIONS = {
  2. # 'scrapy.extensions.telnet.TelnetConsole': None,
  3. 'chouti.extensions.MyExtend':200,
  4. }

八、配置文件

Scrapy默认配置文件

  1. """
  2. This module contains the default values for all settings used by Scrapy.
  3.  
  4. For more information about these settings you can read the settings
  5. documentation in docs/topics/settings.rst
  6.  
  7. Scrapy developers, if you add a setting here remember to:
  8.  
  9. * add it in alphabetical order
  10. * group similar settings without leaving blank lines
  11. * add its documentation to the available settings documentation
  12. (docs/topics/settings.rst)
  13.  
  14. """
  15.  
  16. import sys
  17. from importlib import import_module
  18. from os.path import join, abspath, dirname
  19.  
  20. import six
  21.  
  22. AJAXCRAWL_ENABLED = False
  23.  
  24. AUTOTHROTTLE_ENABLED = False
  25. AUTOTHROTTLE_DEBUG = False
  26. AUTOTHROTTLE_MAX_DELAY = 60.0
  27. AUTOTHROTTLE_START_DELAY = 5.0
  28. AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
  29.  
  30. BOT_NAME = 'scrapybot'
  31.  
  32. CLOSESPIDER_TIMEOUT = 0
  33. CLOSESPIDER_PAGECOUNT = 0
  34. CLOSESPIDER_ITEMCOUNT = 0
  35. CLOSESPIDER_ERRORCOUNT = 0
  36.  
  37. COMMANDS_MODULE = ''
  38.  
  39. COMPRESSION_ENABLED = True
  40.  
  41. CONCURRENT_ITEMS = 100
  42.  
  43. CONCURRENT_REQUESTS = 16
  44. CONCURRENT_REQUESTS_PER_DOMAIN = 8
  45. CONCURRENT_REQUESTS_PER_IP = 0
  46.  
  47. COOKIES_ENABLED = True
  48. COOKIES_DEBUG = False
  49.  
  50. DEFAULT_ITEM_CLASS = 'scrapy.item.Item'
  51.  
  52. DEFAULT_REQUEST_HEADERS = {
  53. 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  54. 'Accept-Language': 'en',
  55. }
  56.  
  57. DEPTH_LIMIT = 0
  58. DEPTH_STATS = True
  59. DEPTH_PRIORITY = 0
  60.  
  61. DNSCACHE_ENABLED = True
  62. DNSCACHE_SIZE = 10000
  63. DNS_TIMEOUT = 60
  64.  
  65. DOWNLOAD_DELAY = 0
  66.  
  67. DOWNLOAD_HANDLERS = {}
  68. DOWNLOAD_HANDLERS_BASE = {
  69. 'data': 'scrapy.core.downloader.handlers.datauri.DataURIDownloadHandler',
  70. 'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler',
  71. 'http': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
  72. 'https': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
  73. 's3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler',
  74. 'ftp': 'scrapy.core.downloader.handlers.ftp.FTPDownloadHandler',
  75. }
  76.  
  77. DOWNLOAD_TIMEOUT = 180 # 3mins
  78.  
  79. DOWNLOAD_MAXSIZE = 1024*1024*1024 # 1024m
  80. DOWNLOAD_WARNSIZE = 32*1024*1024 # 32m
  81.  
  82. DOWNLOAD_FAIL_ON_DATALOSS = True
  83.  
  84. DOWNLOADER = 'scrapy.core.downloader.Downloader'
  85.  
  86. DOWNLOADER_HTTPCLIENTFACTORY = 'scrapy.core.downloader.webclient.ScrapyHTTPClientFactory'
  87. DOWNLOADER_CLIENTCONTEXTFACTORY = 'scrapy.core.downloader.contextfactory.ScrapyClientContextFactory'
  88. DOWNLOADER_CLIENT_TLS_METHOD = 'TLS' # Use highest TLS/SSL protocol version supported by the platform,
  89. # also allowing negotiation
  90.  
  91. DOWNLOADER_MIDDLEWARES = {}
  92.  
  93. DOWNLOADER_MIDDLEWARES_BASE = {
  94. # Engine side
  95. 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
  96. 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
  97. 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
  98. 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
  99. 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
  100. 'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
  101. 'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
  102. 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
  103. 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
  104. 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
  105. 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
  106. 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
  107. 'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
  108. 'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
  109. # Downloader side
  110. }
  111.  
  112. DOWNLOADER_STATS = True
  113.  
  114. DUPEFILTER_CLASS = 'scrapy.dupefilters.RFPDupeFilter'
  115.  
  116. EDITOR = 'vi'
  117. if sys.platform == 'win32':
  118. EDITOR = '%s -m idlelib.idle'
  119.  
  120. EXTENSIONS = {}
  121.  
  122. EXTENSIONS_BASE = {
  123. 'scrapy.extensions.corestats.CoreStats': 0,
  124. 'scrapy.extensions.telnet.TelnetConsole': 0,
  125. 'scrapy.extensions.memusage.MemoryUsage': 0,
  126. 'scrapy.extensions.memdebug.MemoryDebugger': 0,
  127. 'scrapy.extensions.closespider.CloseSpider': 0,
  128. 'scrapy.extensions.feedexport.FeedExporter': 0,
  129. 'scrapy.extensions.logstats.LogStats': 0,
  130. 'scrapy.extensions.spiderstate.SpiderState': 0,
  131. 'scrapy.extensions.throttle.AutoThrottle': 0,
  132. }
  133.  
  134. FEED_TEMPDIR = None
  135. FEED_URI = None
  136. FEED_URI_PARAMS = None # a function to extend uri arguments
  137. FEED_FORMAT = 'jsonlines'
  138. FEED_STORE_EMPTY = False
  139. FEED_EXPORT_ENCODING = None
  140. FEED_EXPORT_FIELDS = None
  141. FEED_STORAGES = {}
  142. FEED_STORAGES_BASE = {
  143. '': 'scrapy.extensions.feedexport.FileFeedStorage',
  144. 'file': 'scrapy.extensions.feedexport.FileFeedStorage',
  145. 'stdout': 'scrapy.extensions.feedexport.StdoutFeedStorage',
  146. 's3': 'scrapy.extensions.feedexport.S3FeedStorage',
  147. 'ftp': 'scrapy.extensions.feedexport.FTPFeedStorage',
  148. }
  149. FEED_EXPORTERS = {}
  150. FEED_EXPORTERS_BASE = {
  151. 'json': 'scrapy.exporters.JsonItemExporter',
  152. 'jsonlines': 'scrapy.exporters.JsonLinesItemExporter',
  153. 'jl': 'scrapy.exporters.JsonLinesItemExporter',
  154. 'csv': 'scrapy.exporters.CsvItemExporter',
  155. 'xml': 'scrapy.exporters.XmlItemExporter',
  156. 'marshal': 'scrapy.exporters.MarshalItemExporter',
  157. 'pickle': 'scrapy.exporters.PickleItemExporter',
  158. }
  159. FEED_EXPORT_INDENT = 0
  160.  
  161. FILES_STORE_S3_ACL = 'private'
  162.  
  163. FTP_USER = 'anonymous'
  164. FTP_PASSWORD = 'guest'
  165. FTP_PASSIVE_MODE = True
  166.  
  167. HTTPCACHE_ENABLED = False
  168. HTTPCACHE_DIR = 'httpcache'
  169. HTTPCACHE_IGNORE_MISSING = False
  170. HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
  171. HTTPCACHE_EXPIRATION_SECS = 0
  172. HTTPCACHE_ALWAYS_STORE = False
  173. HTTPCACHE_IGNORE_HTTP_CODES = []
  174. HTTPCACHE_IGNORE_SCHEMES = ['file']
  175. HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS = []
  176. HTTPCACHE_DBM_MODULE = 'anydbm' if six.PY2 else 'dbm'
  177. HTTPCACHE_POLICY = 'scrapy.extensions.httpcache.DummyPolicy'
  178. HTTPCACHE_GZIP = False
  179.  
  180. HTTPPROXY_ENABLED = True
  181. HTTPPROXY_AUTH_ENCODING = 'latin-1'
  182.  
  183. IMAGES_STORE_S3_ACL = 'private'
  184.  
  185. ITEM_PROCESSOR = 'scrapy.pipelines.ItemPipelineManager'
  186.  
  187. ITEM_PIPELINES = {}
  188. ITEM_PIPELINES_BASE = {}
  189.  
  190. LOG_ENABLED = True
  191. LOG_ENCODING = 'utf-8'
  192. LOG_FORMATTER = 'scrapy.logformatter.LogFormatter'
  193. LOG_FORMAT = '%(asctime)s [%(name)s] %(levelname)s: %(message)s'
  194. LOG_DATEFORMAT = '%Y-%m-%d %H:%M:%S'
  195. LOG_STDOUT = False
  196. LOG_LEVEL = 'DEBUG'
  197. LOG_FILE = None
  198. LOG_SHORT_NAMES = False
  199.  
  200. SCHEDULER_DEBUG = False
  201.  
  202. LOGSTATS_INTERVAL = 60.0
  203.  
  204. MAIL_HOST = 'localhost'
  205. MAIL_PORT = 25
  206. MAIL_FROM = 'scrapy@localhost'
  207. MAIL_PASS = None
  208. MAIL_USER = None
  209.  
  210. MEMDEBUG_ENABLED = False # enable memory debugging
  211. MEMDEBUG_NOTIFY = [] # send memory debugging report by mail at engine shutdown
  212.  
  213. MEMUSAGE_CHECK_INTERVAL_SECONDS = 60.0
  214. MEMUSAGE_ENABLED = True
  215. MEMUSAGE_LIMIT_MB = 0
  216. MEMUSAGE_NOTIFY_MAIL = []
  217. MEMUSAGE_WARNING_MB = 0
  218.  
  219. METAREFRESH_ENABLED = True
  220. METAREFRESH_MAXDELAY = 100
  221.  
  222. NEWSPIDER_MODULE = ''
  223.  
  224. RANDOMIZE_DOWNLOAD_DELAY = True
  225.  
  226. REACTOR_THREADPOOL_MAXSIZE = 10
  227.  
  228. REDIRECT_ENABLED = True
  229. REDIRECT_MAX_TIMES = 20 # uses Firefox default setting
  230. REDIRECT_PRIORITY_ADJUST = +2
  231.  
  232. REFERER_ENABLED = True
  233. REFERRER_POLICY = 'scrapy.spidermiddlewares.referer.DefaultReferrerPolicy'
  234.  
  235. RETRY_ENABLED = True
  236. RETRY_TIMES = 2 # initial response + 2 retries = 3 requests
  237. RETRY_HTTP_CODES = [500, 502, 503, 504, 522, 524, 408]
  238. RETRY_PRIORITY_ADJUST = -1
  239.  
  240. ROBOTSTXT_OBEY = False
  241.  
  242. SCHEDULER = 'scrapy.core.scheduler.Scheduler'
  243. SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleLifoDiskQueue'
  244. SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoMemoryQueue'
  245. SCHEDULER_PRIORITY_QUEUE = 'queuelib.PriorityQueue'
  246.  
  247. SPIDER_LOADER_CLASS = 'scrapy.spiderloader.SpiderLoader'
  248. SPIDER_LOADER_WARN_ONLY = False
  249.  
  250. SPIDER_MIDDLEWARES = {}
  251.  
  252. SPIDER_MIDDLEWARES_BASE = {
  253. # Engine side
  254. 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,
  255. 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,
  256. 'scrapy.spidermiddlewares.referer.RefererMiddleware': 700,
  257. 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,
  258. 'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,
  259. # Spider side
  260. }
  261.  
  262. SPIDER_MODULES = []
  263.  
  264. STATS_CLASS = 'scrapy.statscollectors.MemoryStatsCollector'
  265. STATS_DUMP = True
  266.  
  267. STATSMAILER_RCPTS = []
  268.  
  269. TEMPLATES_DIR = abspath(join(dirname(__file__), '..', 'templates'))
  270.  
  271. URLLENGTH_LIMIT = 2083
  272.  
  273. USER_AGENT = 'Scrapy/%s (+https://scrapy.org)' % import_module('scrapy').__version__
  274.  
  275. TELNETCONSOLE_ENABLED = 1
  276. TELNETCONSOLE_PORT = [6023, 6073]
  277. TELNETCONSOLE_HOST = '127.0.0.1'
  278.  
  279. SPIDER_CONTRACTS = {}
  280. SPIDER_CONTRACTS_BASE = {
  281. 'scrapy.contracts.default.UrlContract': 1,
  282. 'scrapy.contracts.default.ReturnsContract': 2,
  283. 'scrapy.contracts.default.ScrapesContract': 3,
  284. }

默认配置文件

1.深度和优先级

  1. - 深度
  2. - 最开始是0
  3. - 每次yield时,会根据原来请求中的depth + 1
  4. 配置:DEPTH_LIMIT 深度控制
  5. - 优先级
  6. - 请求被下载的优先级 -= 深度 * 配置 DEPTH_PRIORITY
  7. 配置:DEPTH_PRIORITY
  1. def parse(self, response):
  2. #获取当前页的标题
  3. print(response.request.url, response.meta.get('depth', 0))

配置文件解读

  1. # -*- coding: utf-8 -*-
  2.  
  3. # Scrapy settings for step8_king project
  4. #
  5. # For simplicity, this file contains only settings considered important or
  6. # commonly used. You can find more settings consulting the documentation:
  7. #
  8. # http://doc.scrapy.org/en/latest/topics/settings.html
  9. # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
  10. # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
  11.  
  12. # 1. 爬虫名称
  13. BOT_NAME = 'step8_king'
  14.  
  15. # 2. 爬虫应用路径
  16. SPIDER_MODULES = ['step8_king.spiders']
  17. NEWSPIDER_MODULE = 'step8_king.spiders'
  18.  
  19. # Crawl responsibly by identifying yourself (and your website) on the user-agent
  20. # 3. 客户端 user-agent请求头
  21. # USER_AGENT = 'step8_king (+http://www.yourdomain.com)'
  22.  
  23. # Obey robots.txt rules
  24. # 4. 禁止爬虫配置
  25. # ROBOTSTXT_OBEY = False
  26.  
  27. # Configure maximum concurrent requests performed by Scrapy (default: 16)
  28. # 5. 并发请求数
  29. # CONCURRENT_REQUESTS = 4
  30.  
  31. # Configure a delay for requests for the same website (default: 0)
  32. # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
  33. # See also autothrottle settings and docs
  34. # 6. 延迟下载秒数
  35. # DOWNLOAD_DELAY = 2
  36.  
  37. # The download delay setting will honor only one of:
  38. # 7. 单域名访问并发数,并且延迟下次秒数也应用在每个域名
  39. # CONCURRENT_REQUESTS_PER_DOMAIN = 2
  40. # 单IP访问并发数,如果有值则忽略:CONCURRENT_REQUESTS_PER_DOMAIN,并且延迟下次秒数也应用在每个IP
  41. # CONCURRENT_REQUESTS_PER_IP = 3
  42.  
  43. # Disable cookies (enabled by default)
  44. # 8. 是否支持cookie,cookiejar进行操作cookie
  45. # COOKIES_ENABLED = True
  46. # COOKIES_DEBUG = True
  47.  
  48. # Disable Telnet Console (enabled by default)
  49. # 9. Telnet用于查看当前爬虫的信息,操作爬虫等...
  50. # 使用telnet ip port ,然后通过命令操作
  51. # engine.pause() 暂停
  52. # engine.unpause() 重启
  53. # TELNETCONSOLE_ENABLED = True
  54. # TELNETCONSOLE_HOST = '127.0.0.1'
  55. # TELNETCONSOLE_PORT = [6023,]
  56.  
  57. # 10. 默认请求头
  58. # Override the default request headers:
  59. # DEFAULT_REQUEST_HEADERS = {
  60. # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  61. # 'Accept-Language': 'en',
  62. # }
  63.  
  64. # Configure item pipelines
  65. # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
  66. # 11. 定义pipeline处理请求
  67. # ITEM_PIPELINES = {
  68. # 'step8_king.pipelines.JsonPipeline': 700,
  69. # 'step8_king.pipelines.FilePipeline': 500,
  70. # }
  71.  
  72. # 12. 自定义扩展,基于信号进行调用
  73. # Enable or disable extensions
  74. # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
  75. # EXTENSIONS = {
  76. # # 'step8_king.extensions.MyExtension': 500,
  77. # }
  78.  
  79. # 13. 爬虫允许的最大深度,可以通过meta查看当前深度;0表示无深度
  80. # DEPTH_LIMIT = 3
  81.  
  82. # 14. 爬取时,0表示深度优先Lifo(默认);1表示广度优先FiFo
  83.  
  84. # 后进先出,深度优先
  85. # DEPTH_PRIORITY = 0
  86. # SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue'
  87. # SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue'
  88. # 先进先出,广度优先
  89.  
  90. # DEPTH_PRIORITY = 1
  91. # SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
  92. # SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'
  93.  
  94. # 15. 调度器队列
  95. # SCHEDULER = 'scrapy.core.scheduler.Scheduler'
  96. # from scrapy.core.scheduler import Scheduler
  97.  
  98. # 16. 访问URL去重
  99. # DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl'
  100.  
  101. # Enable and configure the AutoThrottle extension (disabled by default)
  102. # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
  103.  
  104. """
  105. 17. 自动限速算法
  106. from scrapy.contrib.throttle import AutoThrottle
  107. 自动限速设置
  108. 1. 获取最小延迟 DOWNLOAD_DELAY
  109. 2. 获取最大延迟 AUTOTHROTTLE_MAX_DELAY
  110. 3. 设置初始下载延迟 AUTOTHROTTLE_START_DELAY
  111. 4. 当请求下载完成后,获取其"连接"时间 latency,即:请求连接到接受到响应头之间的时间
  112. 5. 用于计算的... AUTOTHROTTLE_TARGET_CONCURRENCY
  113. target_delay = latency / self.target_concurrency
  114. new_delay = (slot.delay + target_delay) / 2.0 # 表示上一次的延迟时间
  115. new_delay = max(target_delay, new_delay)
  116. new_delay = min(max(self.mindelay, new_delay), self.maxdelay)
  117. slot.delay = new_delay
  118. """
  119.  
  120. # 开始自动限速
  121. # AUTOTHROTTLE_ENABLED = True
  122. # The initial download delay
  123. # 初始下载延迟
  124. # AUTOTHROTTLE_START_DELAY = 5
  125. # The maximum download delay to be set in case of high latencies
  126. # 最大下载延迟
  127. # AUTOTHROTTLE_MAX_DELAY = 10
  128. # The average number of requests Scrapy should be sending in parallel to each remote server
  129. # 平均每秒并发数
  130. # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
  131.  
  132. # Enable showing throttling stats for every response received:
  133. # 是否显示
  134. # AUTOTHROTTLE_DEBUG = True
  135.  
  136. # Enable and configure HTTP caching (disabled by default)
  137. # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
  138.  
  139. """
  140. 18. 启用缓存
  141. 目的用于将已经发送的请求或相应缓存下来,以便以后使用
  142.  
  143. from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddleware
  144. from scrapy.extensions.httpcache import DummyPolicy
  145. from scrapy.extensions.httpcache import FilesystemCacheStorage
  146. """
  147. # 是否启用缓存策略
  148. # HTTPCACHE_ENABLED = True
  149.  
  150. # 缓存策略:所有请求均缓存,下次在请求直接访问原来的缓存即可
  151. # HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy"
  152. # 缓存策略:根据Http响应头:Cache-Control、Last-Modified 等进行缓存的策略
  153. # HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy"
  154.  
  155. # 缓存超时时间
  156. # HTTPCACHE_EXPIRATION_SECS = 0
  157.  
  158. # 缓存保存路径
  159. # HTTPCACHE_DIR = 'httpcache'
  160.  
  161. # 缓存忽略的Http状态码
  162. # HTTPCACHE_IGNORE_HTTP_CODES = []
  163.  
  164. # 缓存存储的插件
  165. # HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
  166.  
  167. """
  168. 19. 代理,需要在环境变量中设置
  169. from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware
  170.  
  171. 方式一:使用默认
  172. os.environ
  173. {
  174. http_proxy:http://root:woshiniba@192.168.11.11:9999/
  175. https_proxy:http://192.168.11.11:9999/
  176. }
  177. 方式二:使用自定义下载中间件
  178.  
  179. def to_bytes(text, encoding=None, errors='strict'):
  180. if isinstance(text, bytes):
  181. return text
  182. if not isinstance(text, six.string_types):
  183. raise TypeError('to_bytes must receive a unicode, str or bytes '
  184. 'object, got %s' % type(text).__name__)
  185. if encoding is None:
  186. encoding = 'utf-8'
  187. return text.encode(encoding, errors)
  188.  
  189. class ProxyMiddleware(object):
  190. def process_request(self, request, spider):
  191. PROXIES = [
  192. {'ip_port': '111.11.228.75:80', 'user_pass': ''},
  193. {'ip_port': '120.198.243.22:80', 'user_pass': ''},
  194. {'ip_port': '111.8.60.9:8123', 'user_pass': ''},
  195. {'ip_port': '101.71.27.120:80', 'user_pass': ''},
  196. {'ip_port': '122.96.59.104:80', 'user_pass': ''},
  197. {'ip_port': '122.224.249.122:8088', 'user_pass': ''},
  198. ]
  199. proxy = random.choice(PROXIES)
  200. if proxy['user_pass'] is not None:
  201. request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
  202. encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass']))
  203. request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)
  204. print "**************ProxyMiddleware have pass************" + proxy['ip_port']
  205. else:
  206. print "**************ProxyMiddleware no pass************" + proxy['ip_port']
  207. request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
  208.  
  209. DOWNLOADER_MIDDLEWARES = {
  210. 'step8_king.middlewares.ProxyMiddleware': 500,
  211. }
  212.  
  213. """
  214.  
  215. """
  216. 20. Https访问
  217. Https访问时有两种情况:
  218. 1. 要爬取网站使用的可信任证书(默认支持)
  219. DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
  220. DOWNLOADER_CLIENTCONTEXTFACTORY = "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory"
  221.  
  222. 2. 要爬取网站使用的自定义证书
  223. DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
  224. DOWNLOADER_CLIENTCONTEXTFACTORY = "step8_king.https.MySSLFactory"
  225.  
  226. # https.py
  227. from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory
  228. from twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate)
  229.  
  230. class MySSLFactory(ScrapyClientContextFactory):
  231. def getCertificateOptions(self):
  232. from OpenSSL import crypto
  233. v1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.key.unsecure', mode='r').read())
  234. v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.pem', mode='r').read())
  235. return CertificateOptions(
  236. privateKey=v1, # pKey对象
  237. certificate=v2, # X509对象
  238. verify=False,
  239. method=getattr(self, 'method', getattr(self, '_ssl_method', None))
  240. )
  241. 其他:
  242. 相关类
  243. scrapy.core.downloader.handlers.http.HttpDownloadHandler
  244. scrapy.core.downloader.webclient.ScrapyHTTPClientFactory
  245. scrapy.core.downloader.contextfactory.ScrapyClientContextFactory
  246. 相关配置
  247. DOWNLOADER_HTTPCLIENTFACTORY
  248. DOWNLOADER_CLIENTCONTEXTFACTORY
  249.  
  250. """
  251.  
  252. """
  253. 21. 爬虫中间件
  254. class SpiderMiddleware(object):
  255.  
  256. def process_spider_input(self,response, spider):
  257. '''
  258. 下载完成,执行,然后交给parse处理
  259. :param response:
  260. :param spider:
  261. :return:
  262. '''
  263. pass
  264.  
  265. def process_spider_output(self,response, result, spider):
  266. '''
  267. spider处理完成,返回时调用
  268. :param response:
  269. :param result:
  270. :param spider:
  271. :return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)
  272. '''
  273. return result
  274.  
  275. def process_spider_exception(self,response, exception, spider):
  276. '''
  277. 异常调用
  278. :param response:
  279. :param exception:
  280. :param spider:
  281. :return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline
  282. '''
  283. return None
  284.  
  285. def process_start_requests(self,start_requests, spider):
  286. '''
  287. 爬虫启动时调用
  288. :param start_requests:
  289. :param spider:
  290. :return: 包含 Request 对象的可迭代对象
  291. '''
  292. return start_requests
  293.  
  294. 内置爬虫中间件:
  295. 'scrapy.contrib.spidermiddleware.httperror.HttpErrorMiddleware': 50,
  296. 'scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware': 500,
  297. 'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': 700,
  298. 'scrapy.contrib.spidermiddleware.urllength.UrlLengthMiddleware': 800,
  299. 'scrapy.contrib.spidermiddleware.depth.DepthMiddleware': 900,
  300.  
  301. """
  302. # from scrapy.contrib.spidermiddleware.referer import RefererMiddleware
  303. # Enable or disable spider middlewares
  304. # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
  305. SPIDER_MIDDLEWARES = {
  306. # 'step8_king.middlewares.SpiderMiddleware': 543,
  307. }
  308.  
  309. """
  310. 22. 下载中间件
  311. class DownMiddleware1(object):
  312. def process_request(self, request, spider):
  313. '''
  314. 请求需要被下载时,经过所有下载器中间件的process_request调用
  315. :param request:
  316. :param spider:
  317. :return:
  318. None,继续后续中间件去下载;
  319. Response对象,停止process_request的执行,开始执行process_response
  320. Request对象,停止中间件的执行,将Request重新调度器
  321. raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception
  322. '''
  323. pass
  324.  
  325. def process_response(self, request, response, spider):
  326. '''
  327. spider处理完成,返回时调用
  328. :param response:
  329. :param result:
  330. :param spider:
  331. :return:
  332. Response 对象:转交给其他中间件process_response
  333. Request 对象:停止中间件,request会被重新调度下载
  334. raise IgnoreRequest 异常:调用Request.errback
  335. '''
  336. print('response1')
  337. return response
  338.  
  339. def process_exception(self, request, exception, spider):
  340. '''
  341. 当下载处理器(download handler)或 process_request() (下载中间件)抛出异常
  342. :param response:
  343. :param exception:
  344. :param spider:
  345. :return:
  346. None:继续交给后续中间件处理异常;
  347. Response对象:停止后续process_exception方法
  348. Request对象:停止中间件,request将会被重新调用下载
  349. '''
  350. return None
  351.  
  352. 默认下载中间件
  353. {
  354. 'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware': 100,
  355. 'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware': 300,
  356. 'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware': 350,
  357. 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,
  358. 'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 500,
  359. 'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware': 550,
  360. 'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580,
  361. 'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': 590,
  362. 'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600,
  363. 'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700,
  364. 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 750,
  365. 'scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware': 830,
  366. 'scrapy.contrib.downloadermiddleware.stats.DownloaderStats': 850,
  367. 'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900,
  368. }
  369.  
  370. """
  371. # from scrapy.contrib.downloadermiddleware.httpauth import HttpAuthMiddleware
  372. # Enable or disable downloader middlewares
  373. # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
  374. # DOWNLOADER_MIDDLEWARES = {
  375. # 'step8_king.middlewares.DownMiddleware1': 100,
  376. # 'step8_king.middlewares.DownMiddleware2': 500,
  377. # }
  378. #23 Logging日志功能
  379. Scrapy提供了log功能,可以通过 logging 模块使用
  380.  
  381. 可以修改配置文件settings.py,任意位置添加下面两行
  382.  
  383. LOG_FILE = "mySpider.log"
  384. LOG_LEVEL = "INFO"
  385. Scrapy提供5层logging级别:
  386.  
  387. CRITICAL - 严重错误(critical)
  388. ERROR - 一般错误(regular errors)
  389. WARNING - 警告信息(warning messages)
  390. INFO - 一般信息(informational messages)
  391. DEBUG - 调试信息(debugging messages)
  392. logging设置
  393. 通过在setting.py中进行以下设置可以被用来配置logging:
  394.  
  395. LOG_ENABLED 默认: True,启用logging
  396. LOG_ENCODING 默认: 'utf-8',logging使用的编码
  397. LOG_FILE 默认: None,在当前目录里创建logging输出文件的文件名
  398. LOG_LEVEL 默认: 'DEBUG',log的最低级别
  399. LOG_STDOUT 默认: False 如果为 True,进程所有的标准输出(及错误)将会被重定向到log中。例如,执行 print "hello" ,其将会在Scrapy log中显示
  400. settings

配置文件解读

九、scraoy-redis实现分布式

 1.基于scrapy-redis的去重规则:基于redis集合

  方案一:完全自定制

  1. # dupeFilter.py
  2.  
  3. from scrapy.dupefilters import BaseDupeFilter
  4.  
  5. class RedisFilter(BaseDupeFilter):
  6. def __init__(self):
  7. pool = ConnectionPool(host='127.0.0.1', port='')
  8. self.conn = Redis(connection_pool=pool)
  9.  
  10. def request_seen(self, request):
  11. """
  12. 检测当前请求是否已经被访问过
  13. :param request:
  14. :return: True表示已经访问过;False表示未访问过
  15. """
  16. fd = request_fingerprint(request=request)
  17. # key可以自定制
  18. added = self.conn.sadd('visited_urls', fd)
  19. return added == 0
  20.  
  21. # settings.py 末尾加
  22. DUPEFILTER_CLASS = "lww.dupeFilter.RedisFilter"

完全自定制

  方案二:完全依赖scrapy-redis

  1. REDIS_HOST = '127.0.0.1' # 主机名
  2. REDIS_PORT = 6379 # 端口
  3. # REDIS_PARAMS = {'password':'beta'} # Redis连接参数 默认:REDIS_PARAMS = {'socket_timeout': 30,'socket_connect_timeout': 30,'retry_on_timeout': True,'encoding': REDIS_ENCODING,})
  4. REDIS_ENCODING = "utf-8" # redis编码类型 默认:'utf-8'
  5. # REDIS_URL = 'redis://user:pass@hostname:9001' # 连接URL(优先于以上配置)
  6.  
  7. DUPEFILTER_KEY = 'dupefilter:%(timestamp)s'
  8.  
  9. # 修改默认去重规则
  10. DUPEFILTER_CLASS ='scrapy_redis.dupefilter.RFPDupeFilter'

利用scrapy_redis实现去重

  方案三: 继承scrapy-redis 实现自定制

  1. dupeFilter.py
  2. class RedisDupeFilter(RFPDupeFilter):
  3. @classmethod
  4. def from_settings(cls, settings):
  5. server = get_redis_from_settings(settings)
  6. # 优点: 可以自定制
  7. # 缺点:不能获取spider,自定制有限
  8. key = defaults.DUPEFILTER_KEY % {'timestamp': 'test_scrapy_redis'}
  9. debug = settings.getbool('DUPEFILTER_DEBUG')
  10. return cls(server, key=key, debug=debug)
  11.  
  12. settings.py
  13. # REDIS_HOST = '127.0.0.1' # 主机名
  14. # REDIS_PORT = 6379 # 端口
  15. # REDIS_PARAMS = {'password':'0000'} # Redis连接参数 默认:REDIS_PARAMS = {'socket_timeout': 30,'socket_connect_timeout': 30,'retry_on_timeout': True,'encoding': REDIS_ENCODING,})
  16. # REDIS_ENCODING = "utf-8"
  17. # DUPEFILTER_CLASS ='lww.dupeFilter.RedisDupeFilter'

继承scrapy-redis实现自定制

 2. 调度器

  方案四:通过修改调度器

  • settings配置
  1. 连接redis配置:
  2. REDIS_HOST = '127.0.0.1' # 主机名
  3. REDIS_PORT = 6073 # 端口
  4. # REDIS_PARAMS = {'password':'xxx'} # Redis连接参数 默认:REDIS_PARAMS = {'socket_timeout': 30,'socket_connect_timeout': 30,'retry_on_timeout': True,'encoding': REDIS_ENCODING,})
  5. REDIS_ENCODING = "utf-8" # redis编码类型 默认:'utf-8'
  6.  
  7. 去重的配置:
  8. DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'
  9.  
  10. 调度器配置:
  11. SCHEDULER = "scrapy_redis.scheduler.Scheduler"
  12.  
  13. DEPTH_PRIORITY = 1 # 广度优先
  14. # DEPTH_PRIORITY = -1 # 深度优先
  15. SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue' # 默认使用优先级队列(默认),其他:PriorityQueue(有序集合),FifoQueue(列表)、LifoQueue(列表)
  16.  
  17. # 广度优先
  18. # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue' # 默认使用优先级队列(默认),其他:PriorityQueue(有序集合),FifoQueue(列表)、LifoQueue(列表)
  19. # 深度优先
  20. # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue' # 默认使用优先级队列(默认),其他:PriorityQueue(有序集合),FifoQueue(列表)、LifoQueue(列表)
  21. SCHEDULER_QUEUE_KEY = '%(spider)s:requests' # 调度器中请求存放在redis中的key
  22.  
  23. SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat" # 对保存到redis中的数据进行序列化,默认使用pickle
  24.  
  25. SCHEDULER_PERSIST = True # 是否在关闭时候保留原来的调度器和去重记录,True=保留,False=清空
  26. SCHEDULER_FLUSH_ON_START = False # 是否在开始之前清空 调度器和去重记录,True=清空,False=不清空
  27. # SCHEDULER_IDLE_BEFORE_CLOSE = 10 # 去调度器中获取数据时,如果为空,最多等待时间(最后没数据,未获取到)。
  28.  
  29. SCHEDULER_DUPEFILTER_KEY = '%(spider)s:dupefilter' # 去重规则,在redis中保存时对应的key
  30.  
  31. # 优先使用DUPEFILTER_CLASS,如果么有就是用SCHEDULER_DUPEFILTER_CLASS
  32. # SCHEDULER_DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter' # 去重规则对应处理的类
  • 执行流程
  1. 1. scrapy crawl chouti --nolog
  2.  
  3. 2. 找到 SCHEDULER = "scrapy_redis.scheduler.Scheduler" 配置并实例化调度器对象
  4. - 执行Scheduler.from_crawler
  5. - 执行Scheduler.from_settings
  6. - 读取配置文件:
  7. SCHEDULER_PERSIST # 是否在关闭时候保留原来的调度器和去重记录,True=保留,False=清空
  8. SCHEDULER_FLUSH_ON_START # 是否在开始之前清空 调度器和去重记录,True=清空,False=不清空
  9. SCHEDULER_IDLE_BEFORE_CLOSE # 去调度器中获取数据时,如果为空,最多等待时间(最后没数据,未获取到)。
  10. - 读取配置文件:
  11. SCHEDULER_QUEUE_KEY # %(spider)s:requests
  12. SCHEDULER_QUEUE_CLASS # scrapy_redis.queue.FifoQueue
  13. SCHEDULER_DUPEFILTER_KEY # '%(spider)s:dupefilter'
  14. DUPEFILTER_CLASS # 'scrapy_redis.dupefilter.RFPDupeFilter'
  15. SCHEDULER_SERIALIZER # "scrapy_redis.picklecompat"
  16.  
  17. - 读取配置文件:
  18. REDIS_HOST = '140.143.227.206' # 主机名
  19. REDIS_PORT = 8888 # 端口
  20. REDIS_PARAMS = {'password':'beta'} # Redis连接参数 默认:REDIS_PARAMS = {'socket_timeout': 30,'socket_connect_timeout': 30,'retry_on_timeout': True,'encoding': REDIS_ENCODING,})
  21. REDIS_ENCODING = "utf-8"
  22. - 示例Scheduler对象
  23.  
  24. 3. 爬虫开始执行起始URL
  25. - 调用 scheduler.enqueue_requests()
  26. def enqueue_request(self, request):
  27. # 请求是否需要过滤?
  28. # 去重规则中是否已经有?(是否已经访问过,如果未访问添加到去重记录中。)
  29. if not request.dont_filter and self.df.request_seen(request):
  30. # 如果需要过滤且已经被访问过,返回false
  31. self.df.log(request, self.spider)
  32. # 已经访问过就不要再访问了
  33. return False
  34.  
  35. if self.stats:
  36. self.stats.inc_value('scheduler/enqueued/redis', spider=self.spider)
  37. # print('未访问过,添加到调度器', request)
  38. self.queue.push(request)
  39. return True
  40.  
  41. 4. 下载器去调度器中获取任务,去下载
  42.  
  43. - 调用 scheduler.next_requests()
  44. def next_request(self):
  45. block_pop_timeout = self.idle_before_close
  46. request = self.queue.pop(block_pop_timeout)
  47. if request and self.stats:
  48. self.stats.inc_value('scheduler/dequeued/redis', spider=self.spider)
  49. return request

调度器执行流程

      注:scheduler的enqueue_request方法只有在触发start_requests方法时才会执行,所以spider中必须实现start_requets方法

  • 数据持久化
  1. 定义持久化,爬虫yield Item对象时执行RedisPipeline
  2.  
  3. a. item持久化到redis时,指定key和序列化函数
  4.  
  5. REDIS_ITEMS_KEY = '%(spider)s:items'
  6. REDIS_ITEMS_SERIALIZER = 'json.dumps'
  7.  
  8. b. 使用列表保存item数据
  • 起始url相关
  1. """
  2. 起始URL相关
  3.  
  4. a. 获取起始URL时,去集合中获取还是去列表中获取?True,集合;False,列表
  5. REDIS_START_URLS_AS_SET = False # 获取起始URL时,如果为True,则使用self.server.spop;如果为False,则使用self.server.lpop
  6. b. 编写爬虫时,起始URL从redis的Key中获取
  7. REDIS_START_URLS_KEY = '%(name)s:start_urls'
  8.  
  9. """
  10. # If True, it uses redis' ``spop`` operation. This could be useful if you
  11. # want to avoid duplicates in your start urls list. In this cases, urls must
  12. # be added via ``sadd`` command or you will get a type error from redis.
  13. # REDIS_START_URLS_AS_SET = False
  14.  
  15. # Default start urls key for RedisSpider and RedisCrawlSpider.
  16. # REDIS_START_URLS_KEY = '%(name)s:start_urls'

settings配置

  1. from scrapy_redis.spiders import RedisSpider
  2.  
  3. class SpiderchoutiSpider(RedisSpider):
  4. name = 'spiderchouti'
  5. allowed_domains = ['dig.chouti.com']
  6. # 不用写start_urls

spider类继承RedisSpider

  1. from redis import Redis, ConnectionPool
  2.  
  3. pool = ConnectionPool(host='127.0.0.1', port=6379)
  4. conn = Redis(connection_pool=pool)
  5.  
  6. conn.lpush('spiderchouti:start_urls','https://dig.chouti.com/')

redis添加spider_name:start_urls的列表

  • 一些知识点
  1. 1. 什么是深度优先?什么是广度优先?
  2. 就像一颗树,深度优先先执行一颗子树中的所有节点在执行另一颗子树中的所有节点,广度优先先执行完一层,在执行下一层所有节点
  3. 2. scrapy中如何实现深度和广度优先?
  4. 基于栈和队列实现:
  5. 先进先出,广度优先
  6. 后进先出,深度优先
  7. 基于有序集合实现:
  8. 优先级队列:
  9. DEPTH_PRIORITY = 1 # 广度优先
  10. DEPTH_PRIORITY = -1 # 深度优先
  11.  
  12. 3. scrapy 调度器 队列 dupefilter的关系?
  13.  
  14. 调度器,调配添加或获取那个request.
  15. 队列,存放request,先进先出(深度优先),后进先出(广度优先),优先级队列。
  16. dupefilter,访问记录,去重规则。

知识点

十、TinyScrapy

  1. from twisted.internet import reactor # 事件循环(终止条件,所有的socket都已经移除)
  2. from twisted.web.client import getPage # socket对象(如果下载完成,自动从时间循环中移除...)
  3. from twisted.internet import defer # defer.Deferred 特殊的socket对象 (不会发请求,手动移除)
  4. from queue import Queue
  5.  
  6. class Request(object):
  7. """
  8. 用于封装用户请求相关信息,供用户编写spider时发送请求所用
  9. """
  10. def __init__(self,url,callback):
  11. self.url = url
  12. self.callback = callback
  13.  
  14. class HttpResponse(object):
  15. """
  16. 通过响应请求返回的数据和穿入的request对象封装成一个response对象
  17. 目的是为了将请求返回的数据不仅包括返回的content数据,使其拥有更多的属性,比如请求头,请求url,请求的cookies等等
  18. 更方便的被回调函数所解析有用的数据
  19. """
  20. def __init__(self,content,request):
  21. self.content = content
  22. self.request = request
  23. self.url = request.url
  24. self.text = str(content,encoding='utf-8')
  25.  
  26. class Scheduler(object):
  27. """
  28. 任务调度器:
  29. 1.初始化一个队列
  30. 2.next_request方法:读取队列中的下一个请求
  31. 3.enqueue_request方法:将请求加入到队列
  32. 4.size方法:返回当前队列请求的数量
  33. 5.open方法:无任何操作,返回一个空值,用于引擎中用装饰器装饰的open_spider方法返回一个yield对象
  34. """
  35. def __init__(self):
  36. self.q = Queue()
  37.  
  38. def open(self):
  39. pass
  40.  
  41. def next_request(self):
  42. try:
  43. req = self.q.get(block=False)
  44. except Exception as e:
  45. req = None
  46. return req
  47.  
  48. def enqueue_request(self,req):
  49. self.q.put(req)
  50.  
  51. def size(self):
  52. return self.q.qsize()
  53.  
  54. class ExecutionEngine(object):
  55. """
  56. 引擎:所有的调度
  57. 1.通过open_spider方法将start_requests中的每一个请求加入到scdeuler中的队列当中,
  58. 2.处理每一个请求响应之后的回调函数(get_response_callback)和执行下一次请求的调度(_next_request)
  59. """
  60. def __init__(self):
  61. self._close = None
  62. self.scheduler = None
  63. self.max = 5
  64. self.crawlling = []
  65.  
  66. def get_response_callback(self,content,request):
  67. self.crawlling.remove(request)
  68. response = HttpResponse(content,request)
  69. result = request.callback(response)
  70. import types
  71. if isinstance(result,types.GeneratorType):
  72. for req in result:
  73. self.scheduler.enqueue_request(req)
  74.  
  75. def _next_request(self):
  76. """
  77. 1.对spider对象的请求进行调度
  78. 2.设置事件循环终止条件:调度器队列中请求的个数为0,正在执行的请求数为0
  79. 3.设置最大并发数,根据正在执行的请求数量满足最大并发数条件对sceduler队列中的请求进行调度执行
  80. 4.包括对请求进行下载,以及对返回的数据执行callback函数
  81. 5.开始执行事件循环的下一次请求的调度
  82. """
  83. if self.scheduler.size() == 0 and len(self.crawlling) == 0:
  84. self._close.callback(None)
  85. return
  86. """设置最大并发数为5"""
  87. while len(self.crawlling) < self.max:
  88. req = self.scheduler.next_request()
  89. if not req:
  90. return
  91. self.crawlling.append(req)
  92. d = getPage(req.url.encode('utf-8'))
  93. d.addCallback(self.get_response_callback,req)
  94. d.addCallback(lambda _:reactor.callLater(0,self._next_request))
  95.  
  96. @defer.inlineCallbacks
  97. def open_spider(self,start_requests):
  98. """
  99. 1.创建一个调度器对象
  100. 2.将start_requests中的每一个请求加入到scheduler队列中去
  101. 3.然后开始事件循环执行下一次请求的调度
  102. 注:每一个@defer.inlineCallbacks装饰的函数都必须yield一个对象,即使为None
  103. """
  104. self.scheduler = Scheduler()
  105. yield self.scheduler.open()
  106. while True:
  107. try:
  108. req = next(start_requests)
  109. except StopIteration as e:
  110. break
  111. self.scheduler.enqueue_request(req)
  112. reactor.callLater(0,self._next_request)
  113.  
  114. @defer.inlineCallbacks
  115. def start(self):
  116. """不发送任何请求,需要手动停止,目的是为了夯住循环"""
  117. self._close = defer.Deferred()
  118. yield self._close
  119.  
  120. class Crawler(object):
  121. """
  122. 1.用户封装调度器以及引擎
  123. 2.通过传入的spider对象的路径创建spider对象
  124. 3.创建引擎去打开spider对象,对spider中的每一个请求加入到调度器中的队列中去,通过引擎对请求去进行调度
  125. """
  126. def _create_engine(self):
  127. return ExecutionEngine()
  128.  
  129. def _create_spider(self,spider_cls_path):
  130. """
  131.  
  132. :param spider_cls_path: spider.chouti.ChoutiSpider
  133. :return:
  134. """
  135. module_path,cls_name = spider_cls_path.rsplit('.',maxsplit=1)
  136. import importlib
  137. m = importlib.import_module(module_path)
  138. cls = getattr(m,cls_name)
  139. return cls()
  140.  
  141. @defer.inlineCallbacks
  142. def crawl(self,spider_cls_path):
  143. engine = self._create_engine()
  144. spider = self._create_spider(spider_cls_path)
  145. start_requests = iter(spider.start_requests())
  146. yield engine.open_spider(start_requests) #将start_requests中的每一个请求加入到调度器的队列中去,并有引擎调度请求的执行
  147. yield engine.start() #创建一个defer对象,目的是为了夯住事件循环,手动停止
  148.  
  149. class CrawlerProcess(object):
  150. """
  151. 1.创建一个Crawler对象
  152. 2.将传入的每一个spider对象的路径传入Crawler.crawl方法
  153. 3.并将返回的socket对象加入到集合中
  154. 4.开始事件循环
  155. """
  156. def __init__(self):
  157. self._active = set()
  158.  
  159. def crawl(self,spider_cls_path):
  160. """
  161. :param spider_cls_path:
  162. :return:
  163. """
  164. crawler = Crawler()
  165. d = crawler.crawl(spider_cls_path)
  166. self._active.add(d)
  167.  
  168. def start(self):
  169. dd = defer.DeferredList(self._active)
  170. dd.addBoth(lambda _:reactor.stop())
  171.  
  172. reactor.run()
  173.  
  174. class Command(object):
  175. """
  176. 1.创建开始运行的命令
  177. 2.将每一个spider对象的路径传入到crawl_process.crawl方法中去
  178. 3.crawl_process.crawl方法创建一个Crawler对象,通过调用Crawler.crawl方法创建一个引擎和spider对象
  179. 4.通过引擎的open_spider方法创建一个scheduler对象,将每一个spider对象加入到schedler队列中去,并且通过自身的_next_request方法对下一次请求进行调度
  180. 5.
  181. """
  182. def run(self):
  183. crawl_process = CrawlerProcess()
  184. spider_cls_path_list = ['spider.chouti.ChoutiSpider','spider.cnblogs.CnblogsSpider',]
  185. for spider_cls_path in spider_cls_path_list:
  186. crawl_process.crawl(spider_cls_path)
  187. crawl_process.start()
  188.  
  189. if __name__ == '__main__':
  190. cmd = Command()
  191. cmd.run()

tinyscrapy

  1.  

解读Scrapy框架的更多相关文章

  1. scrapy框架解读--深入理解爬虫原理

    scrapy框架结构图: 组成部分介绍: Scrapy Engine: 负责组件之间数据的流转,当某个动作发生时触发事件 Scheduler: 接收requests,并把他们入队,以便后续的调度 Do ...

  2. Python爬虫Scrapy框架入门(2)

    本文是跟着大神博客,尝试从网站上爬一堆东西,一堆你懂得的东西 附上原创链接: http://www.cnblogs.com/qiyeboy/p/5428240.html 基本思路是,查看网页元素,填写 ...

  3. Python爬虫Scrapy框架入门(1)

    也许是很少接触python的原因,我觉得是Scrapy框架和以往Java框架很不一样:它真的是个框架. 从表层来看,与Java框架引入jar包.配置xml或.property文件不同,Scrapy的模 ...

  4. Scrapy框架使用—quotesbot 项目(学习记录一)

    一.Scrapy框架的安装及相关理论知识的学习可以参考:http://www.yiibai.com/scrapy/scrapy_environment.html 二.重点记录我学习使用scrapy框架 ...

  5. Python爬虫从入门到放弃(十一)之 Scrapy框架整体的一个了解

    这里是通过爬取伯乐在线的全部文章为例子,让自己先对scrapy进行一个整理的理解 该例子中的详细代码会放到我的github地址:https://github.com/pythonsite/spider ...

  6. Python爬虫从入门到放弃(十二)之 Scrapy框架的架构和原理

    这一篇文章主要是为了对scrapy框架的工作流程以及各个组件功能的介绍 Scrapy目前已经可以很好的在python3上运行Scrapy使用了Twisted作为框架,Twisted有些特殊的地方是它是 ...

  7. python爬虫scrapy框架——人工识别登录知乎倒立文字验证码和数字英文验证码(2)

    操作环境:python3 在上一文中python爬虫scrapy框架--人工识别知乎登录知乎倒立文字验证码和数字英文验证码(1)我们已经介绍了用Requests库来登录知乎,本文如果看不懂可以先看之前 ...

  8. 一个scrapy框架的爬虫(爬取京东图书)

    我们的这个爬虫设计来爬取京东图书(jd.com). scrapy框架相信大家比较了解了.里面有很多复杂的机制,超出本文的范围. 1.爬虫spider tips: 1.xpath的语法比较坑,但是你可以 ...

  9. 安装scrapy框架的常见问题及其解决方法

    下面小编讲一下自己在windows10安装及配置Scrapy中遇到的一些坑及其解决的方法,现在总结如下,希望对大家有所帮助. 常见问题一:pip版本需要升级 如果你的pip版本比较老,可能在安装的过程 ...

随机推荐

  1. centos7网络配置方法

    方法一:nmtui    这个是字符界面的图形化网络配置工具 方法二:nmcli 命令行配置 方法三:直接vim /etc/sysconfig/network-scripts/ens----  编辑 ...

  2. Operating system error 32(failed to retrieve text for this error. Reason: 15105)

    一台数据库服务器的事务日志备份作业偶尔会出现几次备份失败的情况,具体的错误信息为: DATE/TIME:    2018/7/30 12:10:52 DESCRIPTION: BackupDiskFi ...

  3. 数字信号处理专题(1)——DDS函数发生器环路Demo

    一.前言 会FPGA硬件描述语言.设计思想和接口协议,掌握些基本的算法是非常重要的,因此开设本专题探讨些基于AD DA数字信号处理系统的一些简单算法,在数字通信 信号分析与检测等领域都会或多或少有应用 ...

  4. 【PAT】A1002 A+B for Polynomials

    仅有两个要注意的点: 如果系数为0,则不输出,所以输入结束以后要先遍历确定系数不为零的项的个数 题目最后一句,精确到小数点后一位,如果这里忽略了,会导致样例1,3,4,5都不能通过

  5. django 视图模式

    一 视图 FBV --- function based view(基于函数视图) CBV --- class based view(基于类的视图函数) 二 请求方式 get post put/patc ...

  6. Linux:Day10 程序包管理

    YUM:yellow dog,Yellowdog Update Modifier yum repository:yum repo 存储了众多rpm包,以及包的相关的无数据文件(放置于特定目录下:rep ...

  7. mysql中group by和order by混用 结果不是理想结果(转)

    文章转自 https://www.cnblogs.com/myphper/p/3767572.html 在使用mysql排序的时候会想到按照降序分组来获得一组数据,而使用order by往往得到的不是 ...

  8. Linux内存管理 (26)内存相关工具

    1. vmstat 参照<Linux CPU占用率监控工具小结-vmstat> 2. memstat memstat可以通过sudo apt install memstat安装,安装包括两 ...

  9. [JLOI2015]骗我呢

    [JLOI2015]骗我呢 Tags:题解 作业部落 评论地址 TAG:数学,DP 题意 骗你呢 求满足以下条件的\(n*m\)的矩阵的个数对\(10^9+7\)取模 对于矩阵中的第\(i\)行第\( ...

  10. Steps to One DP+莫比乌斯反演

    卧槽,这么秀吗??? 暂时留坑...