CrawlSpider介绍

  CrawlSpider是Spider的一个子类,意味着拥有Spider的方法,以及自己的方法,更加高效简洁。其中最显著的功能就是"LinkExtractors"链接提取器。Spider是所有爬虫的基类,其设计只是为了爬取start_urls列表中的网页。然而CrawlSpider更适合在网页中提取url继续进行爬取。

CrawlSpider使用

  1、创建scrapy工程:

scrapy startproject projectName

  2、创建爬虫文件:

scrapy genspider -t crawl SpiderName www.xxx.com
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule class A4567tvSpider(CrawlSpider):
name = '4567Tv'
# allowed_domains = ['www.xxx.com']
start_urls = ['http://www.xxx.com/'] rules = (
Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
) def parse_item(self, response):
item = {}
#item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
#item['name'] = response.xpath('//div[@id="name"]').get()
#item['description'] = response.xpath('//div[@id="description"]').get()
return item

创建的爬虫文件代码

LinkExtractor连接提取器:根据指定规则(正则)进行连接的提取

  Rule规则解析器:将链接提取器提取到的链接进行请求发送,然后对获取的页面数据进行
  指定规则(callback)的解析
   一个链接提取器对应唯一一个规则解析器

爬取4567tv.tv的全栈电影名字以及演员名字进行持久化储存:

spider/4567tv.py:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from crawlProject.items import CrawlprojectItem
#"/frim/index1-2.html"
class A4567tvSpider(CrawlSpider):
name = '4567Tv'
# allowed_domains = ['www.xxx.com']
start_urls = ['https://www.4567tv.tv/frim/index1.html']
link = LinkExtractor(allow=r'/frim/index1-\d+\.html')#链接采集器 正则表达式
#如果正则为空,则匹配所有的链接
link1 = LinkExtractor(allow=r'/movie/indexd+\.html')
rules = (
Rule(link, callback='parse_item', follow=True),#参数三True就是采集所有的网页
Rule(link1, callback='parse_detail'),
)
#rules=():指定不同规则解析器。一个Rule对象表示一种提取规则
#Rule:规则解析器。根据链接提取器中提取到的链接,根据指定规则提取解析器链接网页的内容
def parse_item(self, response):
first_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li')
for url in first_list:
title = url.xpath('./div/a/@title').extract_first()
name = url.xpath('./div/div/p/text()').extract_first()
item = CrawlprojectItem()
item["title"] = title
item["name"] = name
yield item #CrawlSpider的爬取流程:
"""爬虫文件首先根据起始的url、获取该url的网页内容。
链接提取器会根据指定提取规则将步骤a中网页内容中的链接进行提取
规则解析器会根据指定解析规则将链接提取器中的网页中的内容根据指定的规则进行解析
将解析数据封装到item中。提交给管道进行持久化储存
"""

items.py:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html import scrapy class CrawlprojectItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
name = scrapy.Field()

pipelins.py:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html class CrawlprojectPipeline(object):
def __init__(self):
self.fp = None
def open_spider(self,spider):
print("开始爬虫!!!")
self.fp = open("./movies.txt","w",encoding="utf-8")
def process_item(self, item, spider):
self.fp.write(item["title"]+":"+item["name"]+"\n")
return item
def close_spider(self,spider):
print("爬虫结束!!!")
self.fp.close()
# -*- coding: utf-8 -*-

# Scrapy settings for crawlProject project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'crawlProject' SPIDER_MODULES = ['crawlProject.spiders']
NEWSPIDER_MODULE = 'crawlProject.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'crawlProject (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
LOG_LEVEL = "ERROR"
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#} # Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'crawlProject.middlewares.CrawlprojectSpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'crawlProject.middlewares.CrawlprojectDownloaderMiddleware': 543,
#} # Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'crawlProject.pipelines.CrawlprojectPipeline': 300,
} # Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

settings.py

14-scrapy框架(CrawlSpider)的更多相关文章

  1. 全栈爬取-Scrapy框架(CrawlSpider)

    引入 提问:如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬取进行实现(Request模块递归回调parse方法). 方法 ...

  2. Scrapy框架——CrawlSpider类爬虫案例

    Scrapy--CrawlSpider Scrapy框架中分两类爬虫,Spider类和CrawlSpider类. 此案例采用的是CrawlSpider类实现爬虫. 它是Spider的派生类,Spide ...

  3. Scrapy框架——CrawlSpider爬取某招聘信息网站

    CrawlSpider Scrapy框架中分两类爬虫,Spider类和CrawlSpider类. 它是Spider的派生类,Spider类的设计原则是只爬取start_url列表中的网页, 而Craw ...

  4. python爬虫之Scrapy框架(CrawlSpider)

    提问:如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬去进行实现的(Request模块回调) 方法二:基于CrawlSpi ...

  5. 爬虫开发14.scrapy框架之分布式操作

    分布式爬虫 一.redis简单回顾 1.启动redis: mac/linux:   redis-server redis.conf windows: redis-server.exe redis-wi ...

  6. 网络爬虫之scrapy框架(CrawlSpider)

    一.简介 CrawlSpider其实是Spider的一个子类,除了继承到Spider的特性和功能之外,还派生了其自己独有的更强大的特性和功能.其中最显著的功能就是"LinkExtractor ...

  7. Scrapy框架-CrawlSpider

    目录 1.CrawlSpider介绍 2.CrawlSpider源代码 3. LinkExtractors:提取Response中的链接 4. Rules 5.重写Tencent爬虫 6. Spide ...

  8. Scrapy 框架 CrawlSpider 全站数据爬取

    CrawlSpider 全站数据爬取 创建 crawlSpider 爬虫文件 scrapy genspider -t crawl chouti www.xxx.com import scrapy fr ...

  9. 爬虫Scrapy框架-Crawlspider链接提取器与规则解析器

    Crawlspider 一:Crawlspider简介 CrawlSpider其实是Spider的一个子类,除了继承到Spider的特性和功能外,还派生除了其自己独有的更加强大的特性和功能.其中最显著 ...

  10. 16.Python网络爬虫之Scrapy框架(CrawlSpider)

    引入 提问:如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬取进行实现(Request模块递归回调parse方法). 方法 ...

随机推荐

  1. Tornado—三种启动tornado的方式

    第一种启动方式:单进程 import tornado.web # web服务 import tornado.ioloop # I/O 时间循环 class Mainhandler(tornado.we ...

  2. linux 常用命令及软件

    命令基于ubuntu 18.04 修改网卡配置 /etc/netplan/50-cloud-init.yaml #修改 netplan apply #应用修改 修改计算机名 sudo hostname ...

  3. Python动态网页爬虫-----动态网页真实地址破解原理

    参考链接:Python动态网页爬虫-----动态网页真实地址破解原理

  4. IT兄弟连 HTML5教程 HTML5表单 HTML表单中的get和post方法

    指引 表单在网页应用中十分重要,基本上任何一个网站都必须使用到表单元素,所以表单的美观和易于交互对于网站设计就变得十分重要.HTML5对目前Web表单进行了全面提升,使得我们使用表单更加智能.它在保持 ...

  5. 什么是单点登录,php是如何实现单点登录的

    单点登录SSO(Single Sign On)说得简单点就是在一个多系统共存的环境下,用户在一处登录后,就不用在其他系统中登录,也就是用户的一次登录能得到其他所有系统的信任.单点登录在大型网站里使用得 ...

  6. vue定义data的三种方式与区别

    在vue中,定义data可以有三种写法. 1.第一种写法,对象. var app = new Vue({ el: '#yanggb', data: { yanggb: 'yanggb' } }) 2. ...

  7. Linux配置部署_新手向(三)——MySql安装与配置

    目录 前言 安装 防火墙 小结 前言 马上就要放假了,按捺不住激动的心情(其实是实在敲不下去代码),就继续鼓捣虚拟机来做些常规的安装与使用吧,毕竟闲着也是闲着,唉,opengl还是难啊. 安装 其实网 ...

  8. ASP.NET Core 2.2 WebApi 系列【四】集成Swagger

    Swagger 是一款自动生成在线接口文档+功能测试功能软件 一.安装程序包 通过管理 NuGet 程序包安装,搜索Swashbuckle.AspNetCore 二.配置 Swagger 将 Swag ...

  9. layui省市区三级联动城市选择

    基于layui框架制作精美的省市区下拉框三级联动菜单选择, 支持三级联动城市选择,点击提交获取选中值代码. 示例图如下: 资源链接: https://pan.baidu.com/s/1s6l8iDBE ...

  10. 基于C# 调用百度AI 人脸识别

    一.设置 登录百度云控制台,添加应用-添加人脸识别,查找,对比等. 记住API Key和Secret Key 二.创建Demo程序 1.使用Nuget安装 Baidu.AI 和 Newtonsoft. ...