一 介绍

Scrapy一个开源和协作的框架,其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的,使用它可以以快速、简单、可扩展的方式从网站中提取所需的数据。但目前Scrapy的用途十分广泛,可用于如数据挖掘、监测和自动化测试等领域,也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。

Scrapy 是基于twisted框架开发而来,twisted是一个流行的事件驱动的python网络框架。因此Scrapy使用了一种非阻塞(又名异步)的代码来实现并发。整体架构大致如下

The data flow in Scrapy is controlled by the execution engine, and goes like this:

  1. The Engine gets the initial Requests to crawl from the Spider.
  2. The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.
  3. The Scheduler returns the next Requests to the Engine.
  4. The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see process_request()).
  5. Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middlewares (see process_response()).
  6. The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (see process_spider_input()).
  7. The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine, passing through the Spider Middleware (see process_spider_output()).
  8. The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks for possible next Requests to crawl.
  9. The process repeats (from step 1) until there are no more requests from the Scheduler.

Components:

  1. 引擎(EGINE)

    引擎负责控制系统所有组件之间的数据流,并在某些动作发生时触发事件。有关详细信息,请参见上面的数据流部分。

  2. 调度器(SCHEDULER)
    用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL的优先级队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
  3. 下载器(DOWLOADER)
    用于下载网页内容, 并将网页内容返回给EGINE,下载器是建立在twisted这个高效的异步模型上的
  4. 爬虫(SPIDERS)
    SPIDERS是开发人员自定义的类,用来解析responses,并且提取items,或者发送新的请求
  5. 项目管道(ITEM PIPLINES)
    在items被提取后负责处理它们,主要包括清理、验证、持久化(比如存到数据库)等操作
  6. 下载器中间件(Downloader Middlewares)
    位于Scrapy引擎和下载器之间,主要用来处理从EGINE传到DOWLOADER的请求request,已经从DOWNLOADER传到EGINE的响应response,你可用该中间件做以下几件事
    1. process a request just before it is sent to the Downloader (i.e. right before Scrapy sends the request to the website);
    2. change received response before passing it to a spider;
    3. send a new Request instead of passing received response to a spider;
    4. pass response to a spider without fetching a web page;
    5. silently drop some requests.
  7. 爬虫中间件(Spider Middlewares)
    位于EGINE和SPIDERS之间,主要工作是处理SPIDERS的输入(即responses)和输出(即requests)

官网链接:https://docs.scrapy.org/en/latest/topics/architecture.html

二 安装

  1. #Windows平台
  2. 1pip3 install wheel #安装后,便支持通过wheel文件安装软件,wheel文件官网:https://www.lfd.uci.edu/~gohlke/pythonlibs
  3. 3pip3 install lxml
  4. 4pip3 install pyopenssl
  5. 5、下载并安装pywin32https://sourceforge.net/projects/pywin32/files/pywin32/
  6. 6、下载twistedwheel文件:http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
  7. 7、执行pip3 install 下载目录\Twisted-17.9.0-cp36-cp36m-win_amd64.whl
  8. 8pip3 install scrapy
  9.  
  10. #Linux平台
  11. 1pip3 install scrapy

三 命令行工具

  1. #1 查看帮助
  2. scrapy -h
  3. scrapy <command> -h
  4.  
  5. #2 有两种命令:其中Project-only必须切到项目文件夹下才能执行,而Global的命令则不需要
  6. Global commands:
  7. startproject #创建项目 scrapy startprojact 项目名
  8. genspider #创建爬虫程序 scrapy genspider 爬虫名 url
  9. settings #如果是在项目目录下,则得到的是该项目的配置
  10. runspider #运行一个独立的python文件,不必创建项目
  11. shell #scrapy shell url地址 在交互式调试,如选择器规则正确与否
  12. fetch #独立于程单纯地爬取一个页面,可以拿到请求头
  13. view #下载完毕后直接弹出浏览器,以此可以分辨出哪些数据是ajax请求
  14. version #scrapy version 查看scrapy的版本,scrapy version -v查看scrapy依赖库的版本
  15. Project-only commands:
  16. crawl #运行爬虫,必须创建项目才行,确保配置文件中ROBOTSTXT_OBEY = False
  17. check #检测项目中有无语法错误
  18. list #列出项目中所包含的爬虫名
  19. edit #编辑器,一般不用
  20. parse #scrapy parse url地址 --callback 回调函数 #以此可以验证我们的回调函数是否正确
  21. bench #scrapy bentch压力测试
  22.  
  23. #3 官网链接
  24. https://docs.scrapy.org/en/latest/topics/commands.html
  1. #1、执行全局命令:请确保不在某个项目的目录下,排除受该项目配置的影响
  2. scrapy startproject MyProject
  3.  
  4. cd MyProject
  5. scrapy genspider baidu www.baidu.com
  6.  
  7. scrapy settings --get XXX #如果切换到项目目录下,看到的则是该项目的配置
  8.  
  9. scrapy runspider baidu.py
  10.  
  11. scrapy shell https://www.baidu.com
  12. response
  13. response.status
  14. response.body
  15. view(response)
  16.  
  17. scrapy view https://www.taobao.com #如果页面显示内容不全,不全的内容则是ajax请求实现的,以此快速定位问题
  18.  
  19. scrapy fetch --nolog --headers https://www.taobao.com
  20.  
  21. scrapy version #scrapy的版本
  22.  
  23. scrapy version -v #依赖库的版本
  24.  
  25. #2、执行项目命令:切到项目目录下
  26. scrapy crawl baidu
  27. scrapy check
  28. scrapy list
  29. scrapy parse http://quotes.toscrape.com/ --callback parse
  30. scrapy bench

示范用法

四 项目结构以及爬虫应用简介

  1. project_name/
  2. scrapy.cfg
  3. project_name/
  4. __init__.py
  5. items.py
  6. pipelines.py
  7. settings.py
  8. spiders/
  9. __init__.py
  10. 爬虫1.py
  11. 爬虫2.py
  12. 爬虫3.py

文件说明:

  • scrapy.cfg  项目的主配置信息,用来部署scrapy时使用,爬虫相关的配置信息在settings.py文件中。
  • items.py    设置数据存储模板,用于结构化数据,如:Django的Model
  • pipelines    数据处理行为,如:一般结构化的数据持久化
  • settings.py 配置文件,如:递归的层数、并发数,延迟下载等。强调:配置文件的选项必须大写否则视为无效,正确写法USER_AGENT='xxxx'
  • spiders      爬虫目录,如:创建文件,编写爬虫规则

注意:一般创建爬虫文件时,以网站域名命名

  1. #在项目目录下新建:entrypoint.py
  2. from scrapy.cmdline import execute
  3. execute(['scrapy', 'crawl', 'xiaohua'])

默认只能在cmd中执行爬虫,如果想在pycharm中执行需要做

  1. import sys,os
  2. sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')

关于windows编码

五 Spiders

1、介绍

  1. #1、Spiders是由一系列类(定义了一个网址或一组网址将被爬取)组成,具体包括如何执行爬取任务并且如何从页面中提取结构化的数据。
  2.  
  3. #2、换句话说,Spiders是你为了一个特定的网址或一组网址自定义爬取和解析页面行为的地方

2、Spiders会循环做如下事情

  1. #1、生成初始的Requests来爬取第一个URLS,并且标识一个回调函数
  2. 第一个请求定义在start_requests()方法内默认从start_urls列表中获得url地址来生成Request请求,默认的回调函数是parse方法。回调函数在下载完成返回response时自动触发
  3.  
  4. #2、在回调函数中,解析response并且返回值
  5. 返回值可以4种:
  6. 包含解析数据的字典
  7. Item对象
  8. 新的Request对象(新的Requests也需要指定一个回调函数)
  9. 或者是可迭代对象(包含ItemsRequest
  10.  
  11. #3、在回调函数中解析页面内容
  12. 通常使用Scrapy自带的Selectors,但很明显你也可以使用Beutifulsouplxml或其他你爱用啥用啥。
  13.  
  14. #4、最后,针对返回的Items对象将会被持久化到数据库
  15. 通过Item Pipeline组件存到数据库:https://docs.scrapy.org/en/latest/topics/item-pipeline.html#topics-item-pipeline)
  16. 或者导出到不同的文件(通过Feed exportshttps://docs.scrapy.org/en/latest/topics/feed-exports.html#topics-feed-exports)

3、Spiders总共提供了五种类:

  1. #1、scrapy.spiders.Spider #scrapy.Spider等同于scrapy.spiders.Spider
  2. #2、scrapy.spiders.CrawlSpider
  3. #3、scrapy.spiders.XMLFeedSpider
  4. #4、scrapy.spiders.CSVFeedSpider
  5. #5、scrapy.spiders.SitemapSpider

4、导入使用

  1. # -*- coding: utf-8 -*-
  2. import scrapy
  3. from scrapy.spiders import Spider,CrawlSpider,XMLFeedSpider,CSVFeedSpider,SitemapSpider
  4.  
  5. class AmazonSpider(scrapy.Spider): #自定义类,继承Spiders提供的基类
  6. name = 'amazon'
  7. allowed_domains = ['www.amazon.cn']
  8. start_urls = ['http://www.amazon.cn/']
  9.  
  10. def parse(self, response):
  11. pass

5、class scrapy.spiders.Spider

这是最简单的spider类,任何其他的spider类都需要继承它(包含你自己定义的)。

该类不提供任何特殊的功能,它仅提供了一个默认的start_requests方法默认从start_urls中读取url地址发送requests请求,并且默认parse作为回调函数

  1. class AmazonSpider(scrapy.Spider):
  2. name = 'amazon'
  3.  
  4. allowed_domains = ['www.amazon.cn']
  5.  
  6. start_urls = ['http://www.amazon.cn/']
  7.  
  8. custom_settings = {
  9. 'BOT_NAME' : 'Egon_Spider_Amazon',
  10. 'REQUEST_HEADERS' : {
  11. 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  12. 'Accept-Language': 'en',
  13. }
  14. }
  15.  
  16. def parse(self, response):
  17. pass
  1. #1、name = 'amazon'
  2. 定义爬虫名,scrapy会根据该值定位爬虫程序
  3. 所以它必须要有且必须唯一(In Python 2 this must be ASCII only.)
  4.  
  5. #2、allowed_domains = ['www.amazon.cn']
  6. 定义允许爬取的域名,如果OffsiteMiddleware启动(默认就启动),
  7. 那么不属于该列表的域名及其子域名都不允许爬取
  8. 如果爬取的网址为:https://www.example.com/1.html,那就添加'example.com'到列表.
  9.  
  10. #3、start_urls = ['http://www.amazon.cn/']
  11. 如果没有指定url,就从该列表中读取url来生成第一个请求
  12.  
  13. #4、custom_settings
  14. 值为一个字典,定义一些配置信息,在运行爬虫程序时,这些配置会覆盖项目级别的配置
  15. 所以custom_settings必须被定义成一个类属性,由于settings会在类实例化前被加载
  16.  
  17. #5、settings
  18. 通过self.settings['配置项的名字']可以访问settings.py中的配置,如果自己定义了custom_settings还是以自己的为准
  19.  
  20. #6、logger
  21. 日志名默认为spider的名字
  22. self.logger.debug('=============>%s' %self.settings['BOT_NAME'])
  23.  
  24. #5、crawler:了解
  25. 该属性必须被定义到类方法from_crawler
  26.  
  27. #6、from_crawler(crawler, *args, **kwargs):了解
  28. You probably wont need to override this directly because the default implementation acts as a proxy to the __init__() method, calling it with the given arguments args and named arguments kwargs.
  29.  
  30. #7、start_requests()
  31. 该方法用来发起第一个Requests请求,且必须返回一个可迭代的对象。它在爬虫程序打开时就被Scrapy调用,Scrapy只调用它一次。
  32. 默认从start_urls里取出每个url来生成Request(url, dont_filter=True)
  33.  
  34. #针对参数dont_filter,请看自定义去重规则
  35.  
  36. 如果你想要改变起始爬取的Requests,你就需要覆盖这个方法,例如你想要起始发送一个POST请求,如下
  37. class MySpider(scrapy.Spider):
  38. name = 'myspider'
  39.  
  40. def start_requests(self):
  41. return [scrapy.FormRequest("http://www.example.com/login",
  42. formdata={'user': 'john', 'pass': 'secret'},
  43. callback=self.logged_in)]
  44.  
  45. def logged_in(self, response):
  46. # here you would extract links to follow and return Requests for
  47. # each of them, with another callback
  48. pass
  49.  
  50. #8、parse(response)
  51. 这是默认的回调函数,所有的回调函数必须返回an iterable of Request and/or dicts or Item objects.
  52.  
  53. #9、log(message[, level, component]):了解
  54. Wrapper that sends a log message through the Spiders logger, kept for backwards compatibility. For more information see Logging from Spiders.
  55.  
  56. #10、closed(reason)
  57. 爬虫程序结束时自动触发

定制scrapy.spider属性与方法详解

  1. 去重规则应该多个爬虫共享的,但凡一个爬虫爬取了,其他都不要爬了,实现方式如下
  2.  
  3. #方法一:
  4. 1、新增类属性
  5. visited=set() #类属性
  6.  
  7. 2、回调函数parse方法内:
  8. def parse(self, response):
  9. if response.url in self.visited:
  10. return None
  11. .......
  12.  
  13. self.visited.add(response.url)
  14.  
  15. #方法一改进:针对url可能过长,所以我们存放url的hash值
  16. def parse(self, response):
  17. url=md5(response.request.url)
  18. if url in self.visited:
  19. return None
  20. .......
  21.  
  22. self.visited.add(url)
  23.  
  24. #方法二:Scrapy自带去重功能
  25. 配置文件:
  26. DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter' #默认的去重规则帮我们去重,去重规则在内存中
  27. DUPEFILTER_DEBUG = False
  28. JOBDIR = "保存范文记录的日志路径,如:/root/" # 最终路径为 /root/requests.seen,去重规则放文件中
  29.  
  30. scrapy自带去重规则默认为RFPDupeFilter,只需要我们指定
  31. Request(...,dont_filter=False) ,如果dont_filter=True则告诉Scrapy这个URL不参与去重。
  32.  
  33. #方法三:
  34. 我们也可以仿照RFPDupeFilter自定义去重规则,
  35.  
  36. from scrapy.dupefilter import RFPDupeFilter,看源码,仿照BaseDupeFilter
  37.  
  38. #步骤一:在项目目录下自定义去重文件dup.py
  39. class UrlFilter(object):
  40. def __init__(self):
  41. self.visited = set() #或者放到数据库
  42.  
  43. @classmethod
  44. def from_settings(cls, settings):
  45. return cls()
  46.  
  47. def request_seen(self, request):
  48. if request.url in self.visited:
  49. return True
  50. self.visited.add(request.url)
  51.  
  52. def open(self): # can return deferred
  53. pass
  54.  
  55. def close(self, reason): # can return a deferred
  56. pass
  57.  
  58. def log(self, request, spider): # log that a request has been filtered
  59. pass
  60.  
  61. #步骤二:配置文件settings.py:
  62. DUPEFILTER_CLASS = '项目名.dup.UrlFilter'
  63.  
  64. # 源码分析:
  65. from scrapy.core.scheduler import Scheduler
  66. Scheduler下的enqueue_request方法:self.df.request_seen(request)

去重规则:去除重复的url

  1. #例一:
  2. import scrapy
  3.  
  4. class MySpider(scrapy.Spider):
  5. name = 'example.com'
  6. allowed_domains = ['example.com']
  7. start_urls = [
  8. 'http://www.example.com/1.html',
  9. 'http://www.example.com/2.html',
  10. 'http://www.example.com/3.html',
  11. ]
  12.  
  13. def parse(self, response):
  14. self.logger.info('A response from %s just arrived!', response.url)
  15.  
  16. #例二:一个回调函数返回多个Requests和Items
  17. import scrapy
  18.  
  19. class MySpider(scrapy.Spider):
  20. name = 'example.com'
  21. allowed_domains = ['example.com']
  22. start_urls = [
  23. 'http://www.example.com/1.html',
  24. 'http://www.example.com/2.html',
  25. 'http://www.example.com/3.html',
  26. ]
  27.  
  28. def parse(self, response):
  29. for h3 in response.xpath('//h3').extract():
  30. yield {"title": h3}
  31.  
  32. for url in response.xpath('//a/@href').extract():
  33. yield scrapy.Request(url, callback=self.parse)
  34.  
  35. #例三:在start_requests()内直接指定起始爬取的urls,start_urls就没有用了,
  36.  
  37. import scrapy
  38. from myproject.items import MyItem
  39.  
  40. class MySpider(scrapy.Spider):
  41. name = 'example.com'
  42. allowed_domains = ['example.com']
  43.  
  44. def start_requests(self):
  45. yield scrapy.Request('http://www.example.com/1.html', self.parse)
  46. yield scrapy.Request('http://www.example.com/2.html', self.parse)
  47. yield scrapy.Request('http://www.example.com/3.html', self.parse)
  48.  
  49. def parse(self, response):
  50. for h3 in response.xpath('//h3').extract():
  51. yield MyItem(title=h3)
  52.  
  53. for url in response.xpath('//a/@href').extract():
  54. yield scrapy.Request(url, callback=self.parse)

例子

  1. 我们可能需要在命令行为爬虫程序传递参数,比如传递初始的url,像这样
  2. #命令行执行
  3. scrapy crawl myspider -a category=electronics
  4.  
  5. #在__init__方法中可以接收外部传进来的参数
  6. import scrapy
  7.  
  8. class MySpider(scrapy.Spider):
  9. name = 'myspider'
  10.  
  11. def __init__(self, category=None, *args, **kwargs):
  12. super(MySpider, self).__init__(*args, **kwargs)
  13. self.start_urls = ['http://www.example.com/categories/%s' % category]
  14. #...
  15.  
  16. #注意接收的参数全都是字符串,如果想要结构化的数据,你需要用类似json.loads的方法

参数传递

6、其他通用Spiders:https://docs.scrapy.org/en/latest/topics/spiders.html#generic-spiders

六 Selectors

  1. #1 //与/
  2. #2 text
  3. #3、extract与extract_first:从selector对象中解出内容
  4. #4、属性:xpath的属性加前缀@
  5. #4、嵌套查找
  6. #5、设置默认值
  7. #4、按照属性查找
  8. #5、按照属性模糊查找
  9. #6、正则表达式
  10. #7、xpath相对路径
  11. #8、带变量的xpath
  1. response.selector.css()
  2. response.selector.xpath()
  3. 可简写为
  4. response.css()
  5. response.xpath()
  6.  
  7. #1 //与/
  8. response.xpath('//body/a/')#
  9. response.css('div a::text')
  10.  
  11. >>> response.xpath('//body/a') #开头的//代表从整篇文档中寻找,body之后的/代表body的儿子
  12. []
  13. >>> response.xpath('//body//a') #开头的//代表从整篇文档中寻找,body之后的//代表body的子子孙孙
  14. [<Selector xpath='//body//a' data='<a href="image1.html">Name: My image 1 <'>, <Selector xpath='//body//a' data='<a href="image2.html">Name: My image 2 <'>, <Selector xpath='//body//a' data='<a href="
  15. image3.html">Name: My image 3 <'>, <Selector xpath='//body//a' data='<a href="image4.html">Name: My image 4 <'>, <Selector xpath='//body//a' data='<a href="image5.html">Name: My image 5 <'>]
  16.  
  17. #2 text
  18. >>> response.xpath('//body//a/text()')
  19. >>> response.css('body a::text')
  20.  
  21. #3、extract与extract_first:从selector对象中解出内容
  22. >>> response.xpath('//div/a/text()').extract()
  23. ['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']
  24. >>> response.css('div a::text').extract()
  25. ['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']
  26.  
  27. >>> response.xpath('//div/a/text()').extract_first()
  28. 'Name: My image 1 '
  29. >>> response.css('div a::text').extract_first()
  30. 'Name: My image 1 '
  31.  
  32. #4、属性:xpath的属性加前缀@
  33. >>> response.xpath('//div/a/@href').extract_first()
  34. 'image1.html'
  35. >>> response.css('div a::attr(href)').extract_first()
  36. 'image1.html'
  37.  
  38. #4、嵌套查找
  39. >>> response.xpath('//div').css('a').xpath('@href').extract_first()
  40. 'image1.html'
  41.  
  42. #5、设置默认值
  43. >>> response.xpath('//div[@id="xxx"]').extract_first(default="not found")
  44. 'not found'
  45.  
  46. #4、按照属性查找
  47. response.xpath('//div[@id="images"]/a[@href="image3.html"]/text()').extract()
  48. response.css('#images a[@href="image3.html"]/text()').extract()
  49.  
  50. #5、按照属性模糊查找
  51. response.xpath('//a[contains(@href,"image")]/@href').extract()
  52. response.css('a[href*="image"]::attr(href)').extract()
  53.  
  54. response.xpath('//a[contains(@href,"image")]/img/@src').extract()
  55. response.css('a[href*="imag"] img::attr(src)').extract()
  56.  
  57. response.xpath('//*[@href="image1.html"]')
  58. response.css('*[href="image1.html"]')
  59.  
  60. #6、正则表达式
  61. response.xpath('//a/text()').re(r'Name: (.*)')
  62. response.xpath('//a/text()').re_first(r'Name: (.*)')
  63.  
  64. #7、xpath相对路径
  65. >>> res=response.xpath('//a[contains(@href,"3")]')[0]
  66. >>> res.xpath('img')
  67. [<Selector xpath='img' data='<img src="data:image3_thumb.jpg">'>]
  68. >>> res.xpath('./img')
  69. [<Selector xpath='./img' data='<img src="data:image3_thumb.jpg">'>]
  70. >>> res.xpath('.//img')
  71. [<Selector xpath='.//img' data='<img src="data:image3_thumb.jpg">'>]
  72. >>> res.xpath('//img') #这就是从头开始扫描
  73. [<Selector xpath='//img' data='<img src="data:image1_thumb.jpg">'>, <Selector xpath='//img' data='<img src="data:image2_thumb.jpg">'>, <Selector xpath='//img' data='<img src="data:image3_thumb.jpg">'>, <Selector xpa
  74. th='//img' data='<img src="data:image4_thumb.jpg">'>, <Selector xpath='//img' data='<img src="data:image5_thumb.jpg">'>]
  75.  
  76. #8、带变量的xpath
  77. >>> response.xpath('//div[@id=$xxx]/a/text()',xxx='images').extract_first()
  78. 'Name: My image 1 '
  79. >>> response.xpath('//div[count(a)=$yyy]/@id',yyy=5).extract_first() #求有5个a标签的div的id
  80. 'images'

https://docs.scrapy.org/en/latest/topics/selectors.html

七 Items

https://docs.scrapy.org/en/latest/topics/items.html

八 Item Pipeline

  1. #一:可以写多个Pipeline类
  2. #1、如果优先级高的Pipeline的process_item返回一个值或者None,会自动传给下一个pipline的process_item,
  3. #2、如果只想让第一个Pipeline执行,那得让第一个pipline的process_item抛出异常raise DropItem()
  4.  
  5. #3、可以用spider.name == '爬虫名' 来控制哪些爬虫用哪些pipeline
  6.  
  7. 二:示范
  8. from scrapy.exceptions import DropItem
  9.  
  10. class CustomPipeline(object):
  11. def __init__(self,v):
  12. self.value = v
  13.  
  14. @classmethod
  15. def from_crawler(cls, crawler):
  16. """
  17. Scrapy会先通过getattr判断我们是否自定义了from_crawler,有则调它来完
  18. 成实例化
  19. """
  20. val = crawler.settings.getint('MMMM')
  21. return cls(val)
  22.  
  23. def open_spider(self,spider):
  24. """
  25. 爬虫刚启动时执行一次
  26. """
  27. print('')
  28.  
  29. def close_spider(self,spider):
  30. """
  31. 爬虫关闭时执行一次
  32. """
  33. print('')
  34.  
  35. def process_item(self, item, spider):
  36. # 操作并进行持久化
  37.  
  38. # return表示会被后续的pipeline继续处理
  39. return item
  40.  
  41. # 表示将item丢弃,不会被后续pipeline处理
  42. # raise DropItem()

自定义pipeline

  1. #1、settings.py
  2. HOST="127.0.0.1"
  3. PORT=27017
  4. USER="root"
  5. PWD=""
  6. DB="amazon"
  7. TABLE="goods"
  8.  
  9. ITEM_PIPELINES = {
  10. 'Amazon.pipelines.CustomPipeline': 200,
  11. }
  12.  
  13. #2、pipelines.py
  14. class CustomPipeline(object):
  15. def __init__(self,host,port,user,pwd,db,table):
  16. self.host=host
  17. self.port=port
  18. self.user=user
  19. self.pwd=pwd
  20. self.db=db
  21. self.table=table
  22.  
  23. @classmethod
  24. def from_crawler(cls, crawler):
  25. """
  26. Scrapy会先通过getattr判断我们是否自定义了from_crawler,有则调它来完
  27. 成实例化
  28. """
  29. HOST = crawler.settings.get('HOST')
  30. PORT = crawler.settings.get('PORT')
  31. USER = crawler.settings.get('USER')
  32. PWD = crawler.settings.get('PWD')
  33. DB = crawler.settings.get('DB')
  34. TABLE = crawler.settings.get('TABLE')
  35. return cls(HOST,PORT,USER,PWD,DB,TABLE)
  36.  
  37. def open_spider(self,spider):
  38. """
  39. 爬虫刚启动时执行一次
  40. """
  41. self.client = MongoClient('mongodb://%s:%s@%s:%s' %(self.user,self.pwd,self.host,self.port))
  42.  
  43. def close_spider(self,spider):
  44. """
  45. 爬虫关闭时执行一次
  46. """
  47. self.client.close()
  48.  
  49. def process_item(self, item, spider):
  50. # 操作并进行持久化
  51.  
  52. self.client[self.db][self.table].save(dict(item))

示范

https://docs.scrapy.org/en/latest/topics/item-pipeline.html

九 Dowloader Middeware

  1. class DownMiddleware1(object):
  2. def process_request(self, request, spider):
  3. """
  4. 请求需要被下载时,经过所有下载器中间件的process_request调用
  5. :param request:
  6. :param spider:
  7. :return:
  8. None,继续后续中间件去下载;
  9. Response对象,停止process_request的执行,开始执行process_response
  10. Request对象,停止中间件的执行,将Request重新调度器
  11. raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception
  12. """
  13. pass
  14.  
  15. def process_response(self, request, response, spider):
  16. """
  17. spider处理完成,返回时调用
  18. :param response:
  19. :param result:
  20. :param spider:
  21. :return:
  22. Response 对象:转交给其他中间件process_response
  23. Request 对象:停止中间件,request会被重新调度下载
  24. raise IgnoreRequest 异常:调用Request.errback
  25. """
  26. print('response1')
  27. return response
  28.  
  29. def process_exception(self, request, exception, spider):
  30. """
  31. 当下载处理器(download handler)或 process_request() (下载中间件)抛出异常
  32. :param response:
  33. :param exception:
  34. :param spider:
  35. :return:
  36. None:继续交给后续中间件处理异常;
  37. Response对象:停止后续process_exception方法
  38. Request对象:停止中间件,request将会被重新调用下载
  39. """
  40. return None

下载器中间件

https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

  1. class DownMiddleware1(object):
  2. @staticmethod
  3. def get_proxy():
  4. return requests.get("http://127.0.0.1:5010/get/").text
  5.  
  6. @staticmethod
  7. def delete_proxy(proxy):
  8. requests.get("http://127.0.0.1:5010/delete/?proxy={}".format(proxy))
  9.  
  10. def process_request(self, request, spider):
  11. """
  12. 请求需要被下载时,经过所有下载器中间件的process_request调用
  13. :param request:
  14. :param spider:
  15. :return:
  16. None,继续后续中间件去下载;
  17. Response对象,停止process_request的执行,开始执行process_response
  18. Request对象,停止中间件的执行,将Request重新调度器
  19. raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception
  20. """
  21.  
  22. if not hasattr(DownMiddleware1,'proxy_addr'):
  23. DownMiddleware1.proxy_addr = self.get_proxy()
  24. request.meta['download_timeout'] = 5
  25. request.meta["proxy"] = "http://" + self.proxy_addr
  26.  
  27. print('元数据',request.meta)
  28.  
  29. if request.meta.get('depth') == 10 or request.meta.get('retry_times') == 2:
  30. request.meta['depth'] = 0
  31. request.meta['retry_times']=0
  32. self.delete_proxy(self.proxy_addr)
  33.  
  34. DownMiddleware1.proxy_addr=self.get_proxy()
  35. request.meta["proxy"] = "http://" + self.proxy_addr
  36. print('============>',request.meta)
  37. return request
  38.  
  39. return None

十 Spider Middleware

  1. class SpiderMiddleware(object):
  2.  
  3. def process_spider_input(self,response, spider):
  4. """
  5. 下载完成,执行,然后交给parse处理
  6. :param response:
  7. :param spider:
  8. :return:
  9. """
  10. pass
  11.  
  12. def process_spider_output(self,response, result, spider):
  13. """
  14. spider处理完成,返回时调用
  15. :param response:
  16. :param result:
  17. :param spider:
  18. :return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)
  19. """
  20. return result
  21.  
  22. def process_spider_exception(self,response, exception, spider):
  23. """
  24. 异常调用
  25. :param response:
  26. :param exception:
  27. :param spider:
  28. :return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline
  29. """
  30. return None
  31.  
  32. def process_start_requests(self,start_requests, spider):
  33. """
  34. 爬虫启动时调用
  35. :param start_requests:
  36. :param spider:
  37. :return: 包含 Request 对象的可迭代对象
  38. """
  39. return start_requests

爬虫中间件

https://docs.scrapy.org/en/latest/topics/spider-middleware.html

十一 settings.py

  1. # -*- coding: utf-8 -*-
  2.  
  3. # Scrapy settings for step8_king project
  4. #
  5. # For simplicity, this file contains only settings considered important or
  6. # commonly used. You can find more settings consulting the documentation:
  7. #
  8. # http://doc.scrapy.org/en/latest/topics/settings.html
  9. # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
  10. # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
  11.  
  12. # 1. 爬虫名称
  13. BOT_NAME = 'step8_king'
  14.  
  15. # 2. 爬虫应用路径
  16. SPIDER_MODULES = ['step8_king.spiders']
  17. NEWSPIDER_MODULE = 'step8_king.spiders'
  18.  
  19. # Crawl responsibly by identifying yourself (and your website) on the user-agent
  20. # 3. 客户端 user-agent请求头
  21. # USER_AGENT = 'step8_king (+http://www.yourdomain.com)'
  22.  
  23. # Obey robots.txt rules
  24. # 4. 禁止爬虫配置
  25. # ROBOTSTXT_OBEY = False
  26.  
  27. # Configure maximum concurrent requests performed by Scrapy (default: 16)
  28. # 5. 并发请求数
  29. # CONCURRENT_REQUESTS = 4
  30.  
  31. # Configure a delay for requests for the same website (default: 0)
  32. # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
  33. # See also autothrottle settings and docs
  34. # 6. 延迟下载秒数
  35. # DOWNLOAD_DELAY = 2
  36.  
  37. # The download delay setting will honor only one of:
  38. # 7. 单域名访问并发数,并且延迟下次秒数也应用在每个域名
  39. # CONCURRENT_REQUESTS_PER_DOMAIN = 2
  40. # 单IP访问并发数,如果有值则忽略:CONCURRENT_REQUESTS_PER_DOMAIN,并且延迟下次秒数也应用在每个IP
  41. # CONCURRENT_REQUESTS_PER_IP = 3
  42.  
  43. # Disable cookies (enabled by default)
  44. # 8. 是否支持cookie,cookiejar进行操作cookie
  45. # COOKIES_ENABLED = True
  46. # COOKIES_DEBUG = True
  47.  
  48. # Disable Telnet Console (enabled by default)
  49. # 9. Telnet用于查看当前爬虫的信息,操作爬虫等...
  50. # 使用telnet ip port ,然后通过命令操作
  51. # TELNETCONSOLE_ENABLED = True
  52. # TELNETCONSOLE_HOST = '127.0.0.1'
  53. # TELNETCONSOLE_PORT = [6023,]
  54.  
  55. # 10. 默认请求头
  56. # Override the default request headers:
  57. # DEFAULT_REQUEST_HEADERS = {
  58. # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  59. # 'Accept-Language': 'en',
  60. # }
  61.  
  62. # Configure item pipelines
  63. # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
  64. # 11. 定义pipeline处理请求
  65. # ITEM_PIPELINES = {
  66. # 'step8_king.pipelines.JsonPipeline': 700,
  67. # 'step8_king.pipelines.FilePipeline': 500,
  68. # }
  69.  
  70. # 12. 自定义扩展,基于信号进行调用
  71. # Enable or disable extensions
  72. # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
  73. # EXTENSIONS = {
  74. # # 'step8_king.extensions.MyExtension': 500,
  75. # }
  76.  
  77. # 13. 爬虫允许的最大深度,可以通过meta查看当前深度;0表示无深度
  78. # DEPTH_LIMIT = 3
  79.  
  80. # 14. 爬取时,0表示深度优先Lifo(默认);1表示广度优先FiFo
  81.  
  82. # 后进先出,深度优先
  83. # DEPTH_PRIORITY = 0
  84. # SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue'
  85. # SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue'
  86. # 先进先出,广度优先
  87.  
  88. # DEPTH_PRIORITY = 1
  89. # SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
  90. # SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'
  91.  
  92. # 15. 调度器队列
  93. # SCHEDULER = 'scrapy.core.scheduler.Scheduler'
  94. # from scrapy.core.scheduler import Scheduler
  95.  
  96. # 16. 访问URL去重
  97. # DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl'
  98.  
  99. # Enable and configure the AutoThrottle extension (disabled by default)
  100. # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
  101.  
  102. """
  103. 17. 自动限速算法
  104. from scrapy.contrib.throttle import AutoThrottle
  105. 自动限速设置
  106. 1. 获取最小延迟 DOWNLOAD_DELAY
  107. 2. 获取最大延迟 AUTOTHROTTLE_MAX_DELAY
  108. 3. 设置初始下载延迟 AUTOTHROTTLE_START_DELAY
  109. 4. 当请求下载完成后,获取其"连接"时间 latency,即:请求连接到接受到响应头之间的时间
  110. 5. 用于计算的... AUTOTHROTTLE_TARGET_CONCURRENCY
  111. target_delay = latency / self.target_concurrency
  112. new_delay = (slot.delay + target_delay) / 2.0 # 表示上一次的延迟时间
  113. new_delay = max(target_delay, new_delay)
  114. new_delay = min(max(self.mindelay, new_delay), self.maxdelay)
  115. slot.delay = new_delay
  116. """
  117.  
  118. # 开始自动限速
  119. # AUTOTHROTTLE_ENABLED = True
  120. # The initial download delay
  121. # 初始下载延迟
  122. # AUTOTHROTTLE_START_DELAY = 5
  123. # The maximum download delay to be set in case of high latencies
  124. # 最大下载延迟
  125. # AUTOTHROTTLE_MAX_DELAY = 10
  126. # The average number of requests Scrapy should be sending in parallel to each remote server
  127. # 平均每秒并发数
  128. # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
  129.  
  130. # Enable showing throttling stats for every response received:
  131. # 是否显示
  132. # AUTOTHROTTLE_DEBUG = True
  133.  
  134. # Enable and configure HTTP caching (disabled by default)
  135. # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
  136.  
  137. """
  138. 18. 启用缓存
  139. 目的用于将已经发送的请求或相应缓存下来,以便以后使用
  140.  
  141. from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddleware
  142. from scrapy.extensions.httpcache import DummyPolicy
  143. from scrapy.extensions.httpcache import FilesystemCacheStorage
  144. """
  145. # 是否启用缓存策略
  146. # HTTPCACHE_ENABLED = True
  147.  
  148. # 缓存策略:所有请求均缓存,下次在请求直接访问原来的缓存即可
  149. # HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy"
  150. # 缓存策略:根据Http响应头:Cache-Control、Last-Modified 等进行缓存的策略
  151. # HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy"
  152.  
  153. # 缓存超时时间
  154. # HTTPCACHE_EXPIRATION_SECS = 0
  155.  
  156. # 缓存保存路径
  157. # HTTPCACHE_DIR = 'httpcache'
  158.  
  159. # 缓存忽略的Http状态码
  160. # HTTPCACHE_IGNORE_HTTP_CODES = []
  161.  
  162. # 缓存存储的插件
  163. # HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
  164.  
  165. """
  166. 19. 代理,需要在环境变量中设置
  167. from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware
  168.  
  169. 方式一:使用默认
  170. os.environ
  171. {
  172. http_proxy:http://root:woshiniba@192.168.11.11:9999/
  173. https_proxy:http://192.168.11.11:9999/
  174. }
  175. 方式二:使用自定义下载中间件
  176.  
  177. def to_bytes(text, encoding=None, errors='strict'):
  178. if isinstance(text, bytes):
  179. return text
  180. if not isinstance(text, six.string_types):
  181. raise TypeError('to_bytes must receive a unicode, str or bytes '
  182. 'object, got %s' % type(text).__name__)
  183. if encoding is None:
  184. encoding = 'utf-8'
  185. return text.encode(encoding, errors)
  186.  
  187. class ProxyMiddleware(object):
  188. def process_request(self, request, spider):
  189. PROXIES = [
  190. {'ip_port': '111.11.228.75:80', 'user_pass': ''},
  191. {'ip_port': '120.198.243.22:80', 'user_pass': ''},
  192. {'ip_port': '111.8.60.9:8123', 'user_pass': ''},
  193. {'ip_port': '101.71.27.120:80', 'user_pass': ''},
  194. {'ip_port': '122.96.59.104:80', 'user_pass': ''},
  195. {'ip_port': '122.224.249.122:8088', 'user_pass': ''},
  196. ]
  197. proxy = random.choice(PROXIES)
  198. if proxy['user_pass'] is not None:
  199. request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
  200. encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass']))
  201. request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)
  202. print "**************ProxyMiddleware have pass************" + proxy['ip_port']
  203. else:
  204. print "**************ProxyMiddleware no pass************" + proxy['ip_port']
  205. request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
  206.  
  207. DOWNLOADER_MIDDLEWARES = {
  208. 'step8_king.middlewares.ProxyMiddleware': 500,
  209. }
  210.  
  211. """
  212.  
  213. """
  214. 20. Https访问
  215. Https访问时有两种情况:
  216. 1. 要爬取网站使用的可信任证书(默认支持)
  217. DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
  218. DOWNLOADER_CLIENTCONTEXTFACTORY = "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory"
  219.  
  220. 2. 要爬取网站使用的自定义证书
  221. DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
  222. DOWNLOADER_CLIENTCONTEXTFACTORY = "step8_king.https.MySSLFactory"
  223.  
  224. # https.py
  225. from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory
  226. from twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate)
  227.  
  228. class MySSLFactory(ScrapyClientContextFactory):
  229. def getCertificateOptions(self):
  230. from OpenSSL import crypto
  231. v1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.key.unsecure', mode='r').read())
  232. v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.pem', mode='r').read())
  233. return CertificateOptions(
  234. privateKey=v1, # pKey对象
  235. certificate=v2, # X509对象
  236. verify=False,
  237. method=getattr(self, 'method', getattr(self, '_ssl_method', None))
  238. )
  239. 其他:
  240. 相关类
  241. scrapy.core.downloader.handlers.http.HttpDownloadHandler
  242. scrapy.core.downloader.webclient.ScrapyHTTPClientFactory
  243. scrapy.core.downloader.contextfactory.ScrapyClientContextFactory
  244. 相关配置
  245. DOWNLOADER_HTTPCLIENTFACTORY
  246. DOWNLOADER_CLIENTCONTEXTFACTORY
  247.  
  248. """
  249.  
  250. """
  251. 21. 爬虫中间件
  252. class SpiderMiddleware(object):
  253.  
  254. def process_spider_input(self,response, spider):
  255. '''
  256. 下载完成,执行,然后交给parse处理
  257. :param response:
  258. :param spider:
  259. :return:
  260. '''
  261. pass
  262.  
  263. def process_spider_output(self,response, result, spider):
  264. '''
  265. spider处理完成,返回时调用
  266. :param response:
  267. :param result:
  268. :param spider:
  269. :return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)
  270. '''
  271. return result
  272.  
  273. def process_spider_exception(self,response, exception, spider):
  274. '''
  275. 异常调用
  276. :param response:
  277. :param exception:
  278. :param spider:
  279. :return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline
  280. '''
  281. return None
  282.  
  283. def process_start_requests(self,start_requests, spider):
  284. '''
  285. 爬虫启动时调用
  286. :param start_requests:
  287. :param spider:
  288. :return: 包含 Request 对象的可迭代对象
  289. '''
  290. return start_requests
  291.  
  292. 内置爬虫中间件:
  293. 'scrapy.contrib.spidermiddleware.httperror.HttpErrorMiddleware': 50,
  294. 'scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware': 500,
  295. 'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': 700,
  296. 'scrapy.contrib.spidermiddleware.urllength.UrlLengthMiddleware': 800,
  297. 'scrapy.contrib.spidermiddleware.depth.DepthMiddleware': 900,
  298.  
  299. """
  300. # from scrapy.contrib.spidermiddleware.referer import RefererMiddleware
  301. # Enable or disable spider middlewares
  302. # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
  303. SPIDER_MIDDLEWARES = {
  304. # 'step8_king.middlewares.SpiderMiddleware': 543,
  305. }
  306.  
  307. """
  308. 22. 下载中间件
  309. class DownMiddleware1(object):
  310. def process_request(self, request, spider):
  311. '''
  312. 请求需要被下载时,经过所有下载器中间件的process_request调用
  313. :param request:
  314. :param spider:
  315. :return:
  316. None,继续后续中间件去下载;
  317. Response对象,停止process_request的执行,开始执行process_response
  318. Request对象,停止中间件的执行,将Request重新调度器
  319. raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception
  320. '''
  321. pass
  322.  
  323. def process_response(self, request, response, spider):
  324. '''
  325. spider处理完成,返回时调用
  326. :param response:
  327. :param result:
  328. :param spider:
  329. :return:
  330. Response 对象:转交给其他中间件process_response
  331. Request 对象:停止中间件,request会被重新调度下载
  332. raise IgnoreRequest 异常:调用Request.errback
  333. '''
  334. print('response1')
  335. return response
  336.  
  337. def process_exception(self, request, exception, spider):
  338. '''
  339. 当下载处理器(download handler)或 process_request() (下载中间件)抛出异常
  340. :param response:
  341. :param exception:
  342. :param spider:
  343. :return:
  344. None:继续交给后续中间件处理异常;
  345. Response对象:停止后续process_exception方法
  346. Request对象:停止中间件,request将会被重新调用下载
  347. '''
  348. return None
  349.  
  350. 默认下载中间件
  351. {
  352. 'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware': 100,
  353. 'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware': 300,
  354. 'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware': 350,
  355. 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,
  356. 'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 500,
  357. 'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware': 550,
  358. 'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580,
  359. 'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': 590,
  360. 'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600,
  361. 'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700,
  362. 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 750,
  363. 'scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware': 830,
  364. 'scrapy.contrib.downloadermiddleware.stats.DownloaderStats': 850,
  365. 'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900,
  366. }
  367.  
  368. """
  369. # from scrapy.contrib.downloadermiddleware.httpauth import HttpAuthMiddleware
  370. # Enable or disable downloader middlewares
  371. # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
  372. # DOWNLOADER_MIDDLEWARES = {
  373. # 'step8_king.middlewares.DownMiddleware1': 100,
  374. # 'step8_king.middlewares.DownMiddleware2': 500,
  375. # }

settings.py

十二 爬取亚马逊商品信息

  1. 1
  2. scrapy startproject Amazon
  3. cd Amazon
  4. scrapy genspider spider_goods www.amazon.cn
  5.  
  6. 2settings.py
  7. ROBOTSTXT_OBEY = False
  8. #请求头
  9. DEFAULT_REQUEST_HEADERS = {
  10. 'Referer':'https://www.amazon.cn/',
  11. 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36'
  12. }
  13. #打开注释
  14. HTTPCACHE_ENABLED = True
  15. HTTPCACHE_EXPIRATION_SECS = 0
  16. HTTPCACHE_DIR = 'httpcache'
  17. HTTPCACHE_IGNORE_HTTP_CODES = []
  18. HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
  19.  
  20. 3items.py
  21. class GoodsItem(scrapy.Item):
  22. # define the fields for your item here like:
  23. # name = scrapy.Field()
  24. #商品名字
  25. goods_name = scrapy.Field()
  26. #价钱
  27. goods_price = scrapy.Field()
  28. #配送方式
  29. delivery_method=scrapy.Field()
  30.  
  31. 4spider_goods.py
  32. # -*- coding: utf-8 -*-
  33. import scrapy
  34.  
  35. from Amazon.items import GoodsItem
  36. from scrapy.http import Request
  37. from urllib.parse import urlencode
  38.  
  39. class SpiderGoodsSpider(scrapy.Spider):
  40. name = 'spider_goods'
  41. allowed_domains = ['www.amazon.cn']
  42. # start_urls = ['http://www.amazon.cn/']
  43.  
  44. def __int__(self,keyword=None,*args,**kwargs):
  45. super(SpiderGoodsSpider).__init__(*args,**kwargs)
  46. self.keyword=keyword
  47.  
  48. def start_requests(self):
  49. url='https://www.amazon.cn/s/ref=nb_sb_noss_1?'
  50. paramas={
  51. '__mk_zh_CN': '亚马逊网站',
  52. 'url': 'search - alias = aps',
  53. 'field-keywords': self.keyword
  54. }
  55. url=url+urlencode(paramas,encoding='utf-8')
  56. yield Request(url,callback=self.parse_index)
  57.  
  58. def parse_index(self, response):
  59. print('解析索引页:%s' %response.url)
  60.  
  61. urls=response.xpath('//*[contains(@id,"result_")]/div/div[3]/div[1]/a/@href').extract()
  62. for url in urls:
  63. yield Request(url,callback=self.parse_detail)
  64.  
  65. next_url=response.urljoin(response.xpath('//*[@id="pagnNextLink"]/@href').extract_first())
  66. print('下一页的url',next_url)
  67. yield Request(next_url,callback=self.parse_index)
  68.  
  69. def parse_detail(self,response):
  70. print('解析详情页:%s' %(response.url))
  71.  
  72. item=GoodsItem()
  73. # 商品名字
  74. item['goods_name'] = response.xpath('//*[@id="productTitle"]/text()').extract_first().strip()
  75. # 价钱
  76. item['goods_price'] = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first().strip()
  77. # 配送方式
  78. item['delivery_method'] = ''.join(response.xpath('//*[@id="ddmMerchantMessage"]//text()').extract())
  79. return item
  80.  
  81. 5、自定义pipelines
  82. #sql.py
  83. import pymysql
  84. import settings
  85.  
  86. MYSQL_HOST=settings.MYSQL_HOST
  87. MYSQL_PORT=settings.MYSQL_PORT
  88. MYSQL_USER=settings.MYSQL_USER
  89. MYSQL_PWD=settings.MYSQL_PWD
  90. MYSQL_DB=settings.MYSQL_DB
  91.  
  92. conn=pymysql.connect(
  93. host=MYSQL_HOST,
  94. port=int(MYSQL_PORT),
  95. user=MYSQL_USER,
  96. password=MYSQL_PWD,
  97. db=MYSQL_DB,
  98. charset='utf8'
  99. )
  100. cursor=conn.cursor()
  101.  
  102. class Mysql(object):
  103. @staticmethod
  104. def insert_tables_goods(goods_name,goods_price,deliver_mode):
  105. sql='insert into goods(goods_name,goods_price,delivery_method) values(%s,%s,%s)'
  106. cursor.execute(sql,args=(goods_name,goods_price,deliver_mode))
  107. conn.commit()
  108.  
  109. @staticmethod
  110. def is_repeat(goods_name):
  111. sql='select count(1) from goods where goods_name=%s'
  112. cursor.execute(sql,args=(goods_name,))
  113. if cursor.fetchone()[0] >= 1:
  114. return True
  115.  
  116. if __name__ == '__main__':
  117. cursor.execute('select * from goods;')
  118. print(cursor.fetchall())
  119.  
  120. #pipelines.py
  121. from Amazon.mysqlpipelines.sql import Mysql
  122.  
  123. class AmazonPipeline(object):
  124. def process_item(self, item, spider):
  125. goods_name=item['goods_name']
  126. goods_price=item['goods_price']
  127. delivery_mode=item['delivery_method']
  128. if not Mysql.is_repeat(goods_name):
  129. Mysql.insert_table_goods(goods_name,goods_price,delivery_mode)
  130.  
  131. 6、创建数据库表
  132. create database amazon charset utf8;
  133. create table goods(
  134. id int primary key auto_increment,
  135. goods_name char(30),
  136. goods_price char(20),
  137. delivery_method varchar(50)
  138. );
  139.  
  140. 7settings.py
  141. MYSQL_HOST='localhost'
  142. MYSQL_PORT=''
  143. MYSQL_USER='root'
  144. MYSQL_PWD=''
  145. MYSQL_DB='amazon'
  146.  
  147. #数字代表优先级程度(1-1000随意设置,数值越低,组件的优先级越高)
  148. ITEM_PIPELINES = {
  149. 'Amazon.mysqlpipelines.pipelines.mazonPipeline': 1,
  150. }
  151.  
  152. #8、在项目目录下新建:entrypoint.py
  153. from scrapy.cmdline import execute
  154. execute(['scrapy', 'crawl', 'spider_goods','-a','keyword=iphone8'])

https://pan.baidu.com/s/1boCEBT1

爬虫、框架scrapy的更多相关文章

  1. 教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神

    本博文将带领你从入门到精通爬虫框架Scrapy,最终具备爬取任何网页的数据的能力.本文以校花网为例进行爬取,校花网:http://www.xiaohuar.com/,让你体验爬取校花的成就感. Scr ...

  2. 【转载】教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神

    原文:教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神 本博文将带领你从入门到精通爬虫框架Scrapy,最终具备爬取任何网页的数据的能力.本文以校花网为例进行爬取,校花网:http:/ ...

  3. 爬虫框架Scrapy

    前面十章爬虫笔记陆陆续续记录了一些简单的Python爬虫知识, 用来解决简单的贴吧下载,绩点运算自然不在话下. 不过要想批量下载大量的内容,比如知乎的所有的问答,那便显得游刃不有余了点. 于是乎,爬虫 ...

  4. 第三篇:爬虫框架 - Scrapy

    前言 Python提供了一个比较实用的爬虫框架 - Scrapy.在这个框架下只要定制好指定的几个模块,就能实现一个爬虫. 本文将讲解Scrapy框架的基本体系结构,以及使用这个框架定制爬虫的具体步骤 ...

  5. 网络爬虫框架Scrapy简介

    作者: 黄进(QQ:7149101) 一. 网络爬虫 网络爬虫(又被称为网页蜘蛛,网络机器人),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本:它是一个自动提取网页的程序,它为搜索引擎从万维 ...

  6. Linux 安装python爬虫框架 scrapy

    Linux 安装python爬虫框架 scrapy http://scrapy.org/ Scrapy是python最好用的一个爬虫框架.要求: python2.7.x. 1. Ubuntu14.04 ...

  7. Python爬虫框架Scrapy实例(三)数据存储到MongoDB

    Python爬虫框架Scrapy实例(三)数据存储到MongoDB任务目标:爬取豆瓣电影top250,将数据存储到MongoDB中. items.py文件复制代码# -*- coding: utf-8 ...

  8. 《Python3网络爬虫开发实战》PDF+源代码+《精通Python爬虫框架Scrapy》中英文PDF源代码

    下载:https://pan.baidu.com/s/1oejHek3Vmu0ZYvp4w9ZLsw <Python 3网络爬虫开发实战>中文PDF+源代码 下载:https://pan. ...

  9. Python爬虫框架Scrapy教程(1)—入门

    最近实验室的项目中有一个需求是这样的,需要爬取若干个(数目不小)网站发布的文章元数据(标题.时间.正文等).问题是这些网站都很老旧和小众,当然也不可能遵守 Microdata 这类标准.这时候所有网页 ...

  10. 怎么在32位windows系统上搭建爬虫框架scrapy?

    禁止转载: 自学python,然后搭建爬虫框架scrapy.费了我一上午的心血.终于搭建成功,以防以后忘记搭建流程,特此撰写此贴,开写 ******************************** ...

随机推荐

  1. 云平台项目--学习经验--jsrender前端渲染模板

    jsrender的好处:可以预先自定义一些固定的html标签,在需要显示数据的时候,可以直接传入真实的数据并显示在web页面中,避免了Js编写中的复杂过程:针对高性能和纯字符串渲染并优化,不需要依赖D ...

  2. v-if 和 v-show的区别

    简单来说,v-if 的初始化较快,但切换代价高:v-show 初始化慢,但切换成本低 1.共同点 都是动态显示DOM元素 2.区别 (1)手段: v-if是动态的向DOM树内添加或者删除DOM元素:  ...

  3. java自定义注解学习(二)_注解详解

    上篇文章,我们简单的实现了一个自定义注解,相信大家对自定义注解有了个简单的认识,这篇,这样介绍下注解中的元注解和内置注解 整体图示 内置注解 @Override 重写覆盖 这个注解大家应该经常用到,主 ...

  4. contentInsetAdjustmentBehavior各个值之间的区别

    iOS11也出了不少时候了网上说适配的文章一大堆.关于contentInsetAdjustmentBehavior这个参数之间的区别,好像没什么人能说明.往下看的前提是你已经知道什么是安全区域,没看明 ...

  5. 自学huawei之路-6005-8AP设备启动界面

    返回自学Huawei之路 自学huawei之路-AC6005-8AP设备启动界面 [YK-MES-MASTER] Please check whether system data has been c ...

  6. 【Luogu4609】建筑师(第一类斯特林数,组合数学)

    [Luogu4609]建筑师(组合数学) 题面 洛谷 题解 首先发现整个数组一定被最高值切成左右两半,因此除去最高值之后在左右分开考虑. 考虑一个暴力\(dp\) ,设\(f[i][j]\)表示用了\ ...

  7. 【BZOJ1081】[SCOI2005]超级格雷码(搜索)

    [BZOJ1081][SCOI2005]超级格雷码(搜索) 题面 BZOJ 洛谷 题解 找个规律吧,自己随便手玩一下,就按照正常的顺序枚举一下,发现分奇偶位考虑正序还是逆序就好了. #include& ...

  8. Qt QGraphicsItem 绕中心旋转、放缩

    最近用到了QGraphicsItem,可以通过QGraphicsItemAnimation使其产生动画效果. QGraphicsItemAnimation自带了setPosAt().setRotati ...

  9. token的理解

    今天学习了token,它的英文意思是令牌的意思.在我理解即像通行证一样,在用户登录成功系统后,会为这个用户颁发一个token,这样它去其他系统都免登录,因为有了这个令牌. token的生成我们可以用U ...

  10. C#创建基本图表(Chart Controls)

    在.NET环境下微软提供了强大了图表控件,并给多了很多实例,关于图表的基本元素如下: 并且MSDN给出了创建图表的示例步骤,原文地址:http://msdn.microsoft.com/en-us/l ...