一、Scarpy简介

Scrapy基于事件驱动网络框架 Twisted 编写。(Event-driven networking

因此,Scrapy基于并发性考虑由非阻塞(即异步)的实现。

参考:武Sir笔记

参考:Scrapy 0.25 文档

参考:Scrapy架构概览

二、爬取chouti.com新闻示例

  1. # chouti.py
  2.  
  3. # -*- coding: utf-8 -*-
  4. import scrapy
  5. from scrapy.http import Request
  6. from scrapy.selector import HtmlXPathSelector
  7. from ..items import Day24SpiderItem
  8.  
  9. # For windows:
  10. import sys,io
  11. sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')
  12.  
  13. class ChoutiSpider(scrapy.Spider):
  14. name = 'chouti'
  15. allowed_domains = ['chouti.com']
  16. start_urls = ['http://chouti.com/']
  17.  
  18. def parse(self, response):
  19. # print(response.body)
  20. # print(response.text)
  21. hxs = HtmlXPathSelector(response)
  22. item_list = hxs.xpath('//div[@id="content-list"]/div[@class="item"]')
  23. # 找到首页所有消息的连接、标题、作业信息然后yield给pipeline进行持久化
  24. for item in item_list:
  25. link = item.xpath('./div[@class="news-content"]/div[@class="part1"]/a/@href').extract_first()
  26. title = item.xpath('./div[@class="news-content"]/div[@class="part2"]/@share-title').extract_first()
  27. author = item.xpath('./div[@class="news-content"]/div[@class="part2"]/a[@class="user-a"]/b/text()').extract_first()
  28. yield Day24SpiderItem(link=link,title=title,author=author)
  29.  
  30. # 找到第二页、第三页、、、第十页的消息,全部爬取下来做持久化
  31. # hxs.xpath('//div[@id="dig_lcpage"]//a/@href').extract()
  32. '''或者用正则精确匹配'''
  33. page_url_list = hxs.xpath('//div[@id="dig_lcpage"]//a[re:test(@href,"/all/hot/recent/\d+")]/@href').extract()
  34. for url in page_url_list:
  35. url = "http://dig.chouti.com" + url
  36. print(url)
  37. yield Request(url, callback=self.parse, dont_filter=False)

  

  1. # pipelines.py
  2.  
  3. # -*- coding: utf-8 -*-
  4.  
  5. # Define your item pipelines here
  6. #
  7. # Don't forget to add your pipeline to the ITEM_PIPELINES setting
  8. # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
  9.  
  10. class Day24SpiderPipeline(object):
  11.  
  12. def __init__(self,file_path):
  13. self.file_path = file_path # 文件路径
  14. self.file_obj = None # 文件对象:用于读写操作
  15.  
  16. @classmethod
  17. def from_crawler(cls, crawler):
  18. """
  19. 初始化时候,用于创建pipeline对象
  20. :param crawler:
  21. :return:
  22. """
  23. val = crawler.settings.get('STORAGE_CONFIG')
  24. return cls(val)
  25.  
  26. def process_item(self, item, spider):
  27. print(">>>> ",item)
  28. if 'chouti' == spider.name:
  29. self.file_obj.write(item.get('link') + "\n" + item.get('title') + "\n" + item.get('author') + "\n\n")
  30. return item
  31.  
  32. def open_spider(self, spider):
  33. """
  34. 爬虫开始执行时,调用
  35. :param spider:
  36. :return:
  37. """
  38. if 'chouti' == spider.name:
  39. # 如果不加:encoding='utf-8' 会导致文件里中文乱码
  40. self.file_obj = open(self.file_path,mode='a+',encoding='utf-8')
  41.  
  42. def close_spider(self, spider):
  43. """
  44. 爬虫关闭时,被调用
  45. :param spider:
  46. :return:
  47. """
  48. if 'chouti' == spider.name:
  49. self.file_obj.close()

  

  1. # items.py
  2.  
  3. # -*- coding: utf-8 -*-
  4.  
  5. # Define here the models for your scraped items
  6. #
  7. # See documentation in:
  8. # http://doc.scrapy.org/en/latest/topics/items.html
  9.  
  10. import scrapy
  11.  
  12. class Day24SpiderItem(scrapy.Item):
  13. link = scrapy.Field()
  14. title = scrapy.Field()
  15. author = scrapy.Field()

  

  1. # settings.py
  2.  
  3. # -*- coding: utf-8 -*-
  4.  
  5. # Scrapy settings for day24spider project
  6. #
  7. # For simplicity, this file contains only settings considered important or
  8. # commonly used. You can find more settings consulting the documentation:
  9. #
  10. # http://doc.scrapy.org/en/latest/topics/settings.html
  11. # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
  12. # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
  13.  
  14. BOT_NAME = 'day24spider'
  15.  
  16. SPIDER_MODULES = ['day24spider.spiders']
  17. NEWSPIDER_MODULE = 'day24spider.spiders'
  18.  
  19. # Crawl responsibly by identifying yourself (and your website) on the user-agent
  20. #USER_AGENT = 'day24spider (+http://www.yourdomain.com)'
  21.  
  22. # Obey robots.txt rules
  23. ROBOTSTXT_OBEY = True
  24.  
  25. # Configure maximum concurrent requests performed by Scrapy (default: 16)
  26. #CONCURRENT_REQUESTS = 32
  27.  
  28. # Configure a delay for requests for the same website (default: 0)
  29. # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
  30. # See also autothrottle settings and docs
  31. #DOWNLOAD_DELAY = 3
  32. # The download delay setting will honor only one of:
  33. #CONCURRENT_REQUESTS_PER_DOMAIN = 16
  34. #CONCURRENT_REQUESTS_PER_IP = 16
  35.  
  36. # Disable cookies (enabled by default)
  37. #COOKIES_ENABLED = False
  38.  
  39. # Disable Telnet Console (enabled by default)
  40. #TELNETCONSOLE_ENABLED = False
  41.  
  42. # Override the default request headers:
  43. #DEFAULT_REQUEST_HEADERS = {
  44. # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  45. # 'Accept-Language': 'en',
  46. #}
  47.  
  48. # Enable or disable spider middlewares
  49. # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
  50. #SPIDER_MIDDLEWARES = {
  51. # 'day24spider.middlewares.Day24SpiderSpiderMiddleware': 543,
  52. #}
  53.  
  54. # Enable or disable downloader middlewares
  55. # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
  56. #DOWNLOADER_MIDDLEWARES = {
  57. # 'day24spider.middlewares.MyCustomDownloaderMiddleware': 543,
  58. #}
  59.  
  60. # Enable or disable extensions
  61. # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
  62. #EXTENSIONS = {
  63. # 'scrapy.extensions.telnet.TelnetConsole': None,
  64. #}
  65.  
  66. # Configure item pipelines
  67. # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
  68. ITEM_PIPELINES = {
  69. 'day24spider.pipelines.Day24SpiderPipeline': 300,
  70. }
  71.  
  72. # Enable and configure the AutoThrottle extension (disabled by default)
  73. # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
  74. #AUTOTHROTTLE_ENABLED = True
  75. # The initial download delay
  76. #AUTOTHROTTLE_START_DELAY = 5
  77. # The maximum download delay to be set in case of high latencies
  78. #AUTOTHROTTLE_MAX_DELAY = 60
  79. # The average number of requests Scrapy should be sending in parallel to
  80. # each remote server
  81. #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
  82. # Enable showing throttling stats for every response received:
  83. #AUTOTHROTTLE_DEBUG = False
  84.  
  85. # Enable and configure HTTP caching (disabled by default)
  86. # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
  87. #HTTPCACHE_ENABLED = True
  88. #HTTPCACHE_EXPIRATION_SECS = 0
  89. #HTTPCACHE_DIR = 'httpcache'
  90. #HTTPCACHE_IGNORE_HTTP_CODES = []
  91. #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
  92.  
  93. STORAGE_CONFIG = "chouti.json"
  94. DEPTH_LIMIT = 1

  

三、classmethod方法应用

from_crawler()   -->   __init__()

Scrapy基础01的更多相关文章

  1. javascript基础01

    javascript基础01 Javascript能做些什么? 给予页面灵魂,让页面可以动起来,包括动态的数据,动态的标签,动态的样式等等. 如实现到轮播图.拖拽.放大镜等,而动态的数据就好比不像没有 ...

  2. Androd核心基础01

    Androd核心基础01包含的主要内容如下 Android版本简介 Android体系结构 JVM和DVM的区别 常见adb命令操作 Android工程目录结构 点击事件的四种形式 电话拨号器Demo ...

  3. java基础学习05(面向对象基础01)

    面向对象基础01 1.理解面向对象的概念 2.掌握类与对象的概念3.掌握类的封装性4.掌握类构造方法的使用 实现的目标 1.类与对象的关系.定义.使用 2.对象的创建格式,可以创建多个对象3.对象的内 ...

  4. Linux基础01 学会使用命令帮助

    Linux基础01 学会使用命令帮助 概述 在linux终端,面对命令不知道怎么用,或不记得命令的拼写及参数时,我们需要求助于系统的帮助文档:linux系统内置的帮助文档很详细,通常能解决我们的问题, ...

  5. 可满足性模块理论(SMT)基础 - 01 - 自动机和斯皮尔伯格算术

    可满足性模块理论(SMT)基础 - 01 - 自动机和斯皮尔伯格算术 前言 如果,我们只给出一个数学问题的(比如一道数独题)约束条件,是否有程序可以自动求出一个解? 可满足性模理论(SMT - Sat ...

  6. LibreOJ 2003. 「SDOI2017」新生舞会 基础01分数规划 最大权匹配

    #2003. 「SDOI2017」新生舞会 内存限制:256 MiB时间限制:1500 ms标准输入输出 题目类型:传统评测方式:文本比较 上传者: 匿名 提交提交记录统计讨论测试数据   题目描述 ...

  7. java基础 01

    java基础01 1. /** * JDK: (Java Development ToolKit) java开发工具包.JDK是整个java的核心! * 包括了java运行环境 JRE(Java Ru ...

  8. 0.Python 爬虫之Scrapy入门实践指南(Scrapy基础知识)

    目录 0.0.Scrapy基础 0.1.Scrapy 框架图 0.2.Scrapy主要包括了以下组件: 0.3.Scrapy简单示例如下: 0.4.Scrapy运行流程如下: 0.5.还有什么? 0. ...

  9. 081 01 Android 零基础入门 02 Java面向对象 01 Java面向对象基础 01 初识面向对象 06 new关键字

    081 01 Android 零基础入门 02 Java面向对象 01 Java面向对象基础 01 初识面向对象 06 new关键字 本文知识点:new关键字 说明:因为时间紧张,本人写博客过程中只是 ...

随机推荐

  1. Windows10 + Visual Studio 2017环境为C++工程安装使用ZMQ

    因为需要用 C++ 实现联机对战的功能,但是不想直接用 winsock ,因此选了ZMQ 框架(不知道合不合适).安装的过程还是挺艰辛的.但是也学到了些东西,记录一下.另外,Zmq 的作者 Piete ...

  2. Hdoj 1160.FatMouse's Speed 题解

    Problem Description FatMouse believes that the fatter a mouse is, the faster it runs. To disprove th ...

  3. Android 程序优化总结

    第一部分 编程规范 1.1 基本要求: 程序结构清晰,简单易懂,单个函数的程序行数不得超过100行. 打算干什么,要简单,直接. 尽量使用标准库函数和公共函数 不要随意定义全局变量,尽量使用局部变量. ...

  4. 【redis】redis常用命令及操作记录

    redis-cli是Redis命令行界面,可以向Redis发送命令,并直接从终端读取服务器发送的回复. 它有两种主要模式:一种交互模式,其中有一个REPL(read eval print loop), ...

  5. DP题组

    按照顺序来. Median Sum 大意: 给你一个集合,求其所有非空子集的权值的中位数. 某集合的权值即为其元素之和. 1 <= n <= 2000 解: 集合配对,每个集合都配对它的补 ...

  6. px转换成bp单位的工具函数

    import {Dimensions} from 'react-native' //当前屏幕的高度 const deviceH = Dimensions.get('window').height // ...

  7. 第一篇-Git基础学习

    学习网址: https://www.liaoxuefeng.com/wiki/0013739516305929606dd18361248578c67b8067c8c017b000/0013758410 ...

  8. python简单购物车改进版

    # -*- coding: utf-8 -*- """ ┏┓ ┏┓ ┏┛┻━━━┛┻┓ ┃ ☃ ┃ ┃ ┳┛ ┗┳ ┃ ┃ ┻ ┃ ┗━┓ ┏━┛ ┃ ┗━━━┓ ┃ 神 ...

  9. 第十一节、Harris角点检测原理(附源码)

    OpenCV可以检测图像的主要特征,然后提取这些特征.使其成为图像描述符,这类似于人的眼睛和大脑.这些图像特征可作为图像搜索的数据库.此外,人们可以利用这些关键点将图像拼接起来,组成一个更大的图像,比 ...

  10. python 装饰器的应用

    import time def test1(): print "hello\n" print test1.__name__ def test2(): print "hel ...