13.CrawlSpider类爬虫

1.CrawlSpider介绍

Scrapy框架中分两类爬虫，Spider类和CrawlSpider类。
此案例采用的是CrawlSpider类实现爬虫。

它是Spider的派生类，Spider类的设计原则是只爬取start_url列表中的网页，而CrawlSpider类定义了一些规则(rule)来提供跟进link的方便的机制，从爬取的网页中获取link并继续爬取的工作更适合。

创建项目指令：

scrapy startproject baidu

模版创建：

scrapy genspider -t crawl baidu 'tieba.baidu.com'

CrawlSpider继承于Spider类，除了继承过来的属性外（name、allow_domains），还提供了新的属性和方法:

LinkExtractors

class scrapy.linkextractors.LinkExtractor

Link Extractors 的目的很简单: 提取链接｡
每个LinkExtractor有唯一的公共方法是 extract_links()，它接收一个 Response 对象，并返回一个 scrapy.link.Link 对象。
Link Extractors要实例化一次，并且 extract_links 方法会根据不同的 response 调用多次提取链接｡

主要参数：

            allow：满足括号中“正则表达式”的值会被提取，如果为空，则全部匹配。

            deny：与这个正则表达式(或正则表达式列表)不匹配的URL一定不提取。

            allow_domains：会被提取的链接的domains。

            deny_domains：一定不会被提取链接的domains。

            restrict_xpaths：使用xpath表达式，和allow共同作用过滤链接。

rules

在rules中包含一个或多个Rule对象，每个Rule对爬取网站的动作定义了特定操作。如果多个rule匹配了相同的链接，则根据规则在本集合中被定义的顺序，第一个会被使用。

参数介绍：
link_extractor：是一个Link Extractor对象，用于定义需要提取的链接

callback： 从link_extractor中每获取到链接时，参数所指定的值作为回调函数，该回调函数接受一个response作为其第一个参数。

    注意：当编写爬虫规则时，避免使用parse作为回调函数。由于CrawlSpider使用parse方法来实现其逻辑，如果覆盖了 parse方法，crawl spider将会运行失败。

    follow：是一个布尔(boolean)值，指定了根据该规则从response提取的链接是否需要跟进。 如果callback为None，follow 默认设置为True ，否则默认为False。

    process_links：指定该spider中哪个的函数将会被调用，从link_extractor中获取到链接列表时将会调用该函数。该方法主要用来过滤。

    process_request：指定该spider中哪个的函数将会被调用， 该规则提取到每个request时都会调用该函数。 (用来过滤request)

2.创建案例

a.开始一个项目

scrapy startproject wxapp

b.创建模板

scrapy genspider -t crawl wxapp_spider "wxapp-union.com"

c.settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for wxapp project

#

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

#

#     https://doc.scrapy.org/en/latest/topics/settings.html

#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'wxapp'

SPIDER_MODULES = ['wxapp.spiders']

NEWSPIDER_MODULE = 'wxapp.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'wxapp (+http://www.yourdomain.com)'

# Obey robots.txt rules

ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

DOWNLOAD_DELAY = 2

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

DEFAULT_REQUEST_HEADERS = {

  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

  'Accept-Language': 'en',

    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'

}

# Enable or disable spider middlewares

# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

#    'wxapp.middlewares.WxappSpiderMiddleware': 543,

#}

# Enable or disable downloader middlewares

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

#    'wxapp.middlewares.WxappDownloaderMiddleware': 543,

#}

# Enable or disable extensions

# See https://doc.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

#    'scrapy.extensions.telnet.TelnetConsole': None,

#}

# Configure item pipelines

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

   'wxapp.pipelines.WxappPipeline': 300,

}

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

wxapp_spider.py

# -*- coding: utf-8 -*-

import scrapy

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule

from wxapp.items import WxappItem

class WxappSpiderSpider(CrawlSpider):

    name = 'wxapp_spider'

    allowed_domains = ['wxapp-union.com']

    start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']

    rules = (

        Rule(LinkExtractor(allow=r'.+mod=list&catid=2&page=\d'), follow=True),

        Rule(LinkExtractor(allow=r'.+article-.+\.html'),callback="parse_detail",follow=False)

    )

    def parse_detail(self, response):

        title=response.xpath('//h1[@class="ph"]/text()').get()

        author_p=response.xpath('//p[@class="authors"]')

        author=author_p.xpath('.//a/text()').get()

        pub_time=author_p.xpath('.//span/text()').get()

        content=response.xpath('//td[@id="article_content"]//text()').getall()

        item=WxappItem(title=title,author=author,pub_time=pub_time,content=content)

        yield item

        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()

        #i['name'] = response.xpath('//div[@id="name"]').extract()

        #i['description'] = response.xpath('//div[@id="description"]').extract()

items.py

import scrapy

class WxappItem(scrapy.Item):

    # define the fields for your item here like:

    # name = scrapy.Field()

    title=scrapy.Field()

    author=scrapy.Field()

    pub_time=scrapy.Field()

    content=scrapy.Field()

pipelines.py

from scrapy.exporters import JsonLinesItemExporter

class WxappPipeline(object):

    def __init__(self):

        self.fp=open("wxapp.json","wb")

        self.exporter=JsonLinesItemExporter(self.fp,ensure_ascii=False,encoding="utf-8")

    def process_item(self, item, spider):

        self.exporter.export_item(item)

        return item

    def close_spider(self,spider):

        self.fp.close()

在wxapp目录下创建：start.py

from scrapy import cmdline

cmdline.execute("scrapy crawl wxapp_spider".split())

执行start.py即可

13.CrawlSpider类爬虫的更多相关文章

Scrapy框架——CrawlSpider类爬虫案例
Scrapy--CrawlSpider Scrapy框架中分两类爬虫,Spider类和CrawlSpider类. 此案例采用的是CrawlSpider类实现爬虫. 它是Spider的派生类,Spide ...
Scrapy的Spider类和CrawlSpider类
Scrapy shell 用来调试Scrapy 项目代码的命令行工具,启动的时候预定义了Scrapy的一些对象设置 shell Scrapy 的shell是基于运行环境中的python 解释器sh ...
python爬虫入门（八）Scrapy框架之CrawlSpider类
CrawlSpider类通过下面的命令可以快速创建 CrawlSpider模板的代码: scrapy genspider -t crawl tencent tencent.com CrawSpid ...
爬虫笔记之刷小怪练级：yymp3爬虫（音乐类爬虫）
一.目标爬取http://www.yymp3.com网站歌曲相关信息,包括歌曲名字.作者相关信息.歌曲的音频数据.歌曲的歌词数据. 二.分析 2.1 歌曲信息.歌曲音频数据下载地址的获取随便打开一 ...
scrapy的CrawlSpider类
了解CrawlSpider 踏实爬取一般网站的常用spider,其中定义了一些规则(rule)来提供跟进link的方便机制,也许该spider不适合你的目标网站,但是对于大多数情况是可以使用的.因此, ...
scrapy项目4：爬取当当网中机器学习的数据及价格（CrawlSpider类）
scrapy项目3中已经对网页规律作出解析,这里用crawlspider类对其内容进行爬取: 项目结构与项目3中相同如下图,唯一不同的为book.py文件 crawlspider类的爬虫文件book的 ...
C++ primer plus读书笔记——第13章类继承
第13章类继承 1. 如果购买厂商的C库,除非厂商提供库函数的源代码,否则您将无法根据自己的需求,对函数进行扩展或修改.但如果是类库,只要其提供了类方法的头文件和编译后的代码,仍可以使用库中的类派生 ...
【Spider】使用CrawlSpider进行爬虫时，无法爬取数据，运行后很快结束，但没有报错
在学习<python爬虫开发与项目实践>的时候有一个关于CrawlSpider的例子,当我在运行时发现,没有爬取到任何数据,以下是我敲的源代码:import scrapyfrom UseS ...
scrapy框架用CrawlSpider类爬取电影天堂.
本文使用CrawlSpider方法爬取电影天堂网站内国内电影分类下的所有电影的名称和下载地址 CrawlSpider其实就是Spider的一个子类. CrawlSpider功能更加强大(链接提取器,规 ...

随机推荐

HDU1542-Atlantis【离散化&线段树&扫描线】个人认为很全面的详解
刚上大一的时候见过这种题,感觉好牛逼哇,这都能算如今已经不打了,不过适当写写题保持思维活跃度还是不错的,又碰到这种题了,想把它弄出来说实话,智商不够,看了很多解析,花了4.5个小时才弄明白网上好 ...
TCP/UDP区别
一:1. 大体上来说,TCP和UDP都是通过Internet发送数据包的协议.都建立在Internet协议上.就是无论你是用TCP协议还是用UDP协议发送数据包,都会被发送到IP地址: 2.数据包的处 ...
VCC、VDD和VSS
在电子电路中,常可以看到VCC.VDD和VSS三种不同的符号,它们有什么区别呢? 一.解释 VCC:C=circuit 表示电路的意思, 即接入电路的电压: VDD:D=device 表示器件的意思 ...
CAN协议，系统结构和帧结构
CAN:Controller Area Network,控制器局域网是一种能有效支持分布式控制和实时控制的串行通讯网络. CAN-bus: Controller Area Network-bus,控 ...
Python3 与 C# 并发编程之～线程篇
2.线程篇¶ 在线预览:https://github.lesschina.com/python/base/concurrency/3.并发编程-线程篇.html 示例代码:https://gith ...
【POJ2182】Lost Cows 树状数组+二分
题中给出了第 i 头牛前面有多少比它矮,如果正着分析比较难找到规律.因此,采用倒着分析的方法(最后一头牛的rank可以直接得出),对于第 i 头牛来说,它的rank值为没有被占用的rank集合中的第A ...
RGBColorspace 与 GRAYColorspace 图片混合后，生成的视频有点问题
最近有一个用户遇到一个情况: 有3张图片,其中前两张是 RGBColorspace,最后一张是 GrayColorspace: 生成的视频,在显示最后一张图片的时候,明显出现奇怪的色彩区域,看下图: ...
MySQL表结构的优化和设计
仅供自己学习结论写在前面: 1.给字段选取最合适的数据类型 2.数据类型的宽度尽可能的小 3.给where条件的字段设置索引 4.允许部分数据冗余 5.字段要尽可能的设置为not null,特别 ...
struts2 contextMap
一.contextMap中的数据操作 root根:List 元素1 元素2 元素3 元素4 元素5 contextMap:Map key value application Map key value ...
unsigned 变量名:n
在结构体内定义位,节省空间 /* * size是字节数 * addr是打印的起始地址 */ static void printb(void * addr,size_t size){ ;i<siz ...

13.CrawlSpider类爬虫

rules

13.CrawlSpider类爬虫的更多相关文章

随机推荐

热门专题