14-scrapy框架(CrawlSpider)
CrawlSpider介绍
CrawlSpider是Spider的一个子类,意味着拥有Spider的方法,以及自己的方法,更加高效简洁。其中最显著的功能就是"LinkExtractors"链接提取器。Spider是所有爬虫的基类,其设计只是为了爬取start_urls列表中的网页。然而CrawlSpider更适合在网页中提取url继续进行爬取。
CrawlSpider使用
1、创建scrapy工程:
scrapy startproject projectName
2、创建爬虫文件:
scrapy genspider -t crawl SpiderName www.xxx.com
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule class A4567tvSpider(CrawlSpider):
name = '4567Tv'
# allowed_domains = ['www.xxx.com']
start_urls = ['http://www.xxx.com/'] rules = (
Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
) def parse_item(self, response):
item = {}
#item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
#item['name'] = response.xpath('//div[@id="name"]').get()
#item['description'] = response.xpath('//div[@id="description"]').get()
return item
创建的爬虫文件代码
LinkExtractor连接提取器:根据指定规则(正则)进行连接的提取
Rule规则解析器:将链接提取器提取到的链接进行请求发送,然后对获取的页面数据进行
指定规则(callback)的解析
一个链接提取器对应唯一一个规则解析器
爬取4567tv.tv的全栈电影名字以及演员名字进行持久化储存:
spider/4567tv.py:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from crawlProject.items import CrawlprojectItem
#"/frim/index1-2.html"
class A4567tvSpider(CrawlSpider):
name = '4567Tv'
# allowed_domains = ['www.xxx.com']
start_urls = ['https://www.4567tv.tv/frim/index1.html']
link = LinkExtractor(allow=r'/frim/index1-\d+\.html')#链接采集器 正则表达式
#如果正则为空,则匹配所有的链接
link1 = LinkExtractor(allow=r'/movie/indexd+\.html')
rules = (
Rule(link, callback='parse_item', follow=True),#参数三True就是采集所有的网页
Rule(link1, callback='parse_detail'),
)
#rules=():指定不同规则解析器。一个Rule对象表示一种提取规则
#Rule:规则解析器。根据链接提取器中提取到的链接,根据指定规则提取解析器链接网页的内容
def parse_item(self, response):
first_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li')
for url in first_list:
title = url.xpath('./div/a/@title').extract_first()
name = url.xpath('./div/div/p/text()').extract_first()
item = CrawlprojectItem()
item["title"] = title
item["name"] = name
yield item #CrawlSpider的爬取流程:
"""爬虫文件首先根据起始的url、获取该url的网页内容。
链接提取器会根据指定提取规则将步骤a中网页内容中的链接进行提取
规则解析器会根据指定解析规则将链接提取器中的网页中的内容根据指定的规则进行解析
将解析数据封装到item中。提交给管道进行持久化储存
"""
items.py:
# -*- coding: utf-8 -*- # Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html import scrapy class CrawlprojectItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
name = scrapy.Field()
pipelins.py:
# -*- coding: utf-8 -*- # Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html class CrawlprojectPipeline(object):
def __init__(self):
self.fp = None
def open_spider(self,spider):
print("开始爬虫!!!")
self.fp = open("./movies.txt","w",encoding="utf-8")
def process_item(self, item, spider):
self.fp.write(item["title"]+":"+item["name"]+"\n")
return item
def close_spider(self,spider):
print("爬虫结束!!!")
self.fp.close()
# -*- coding: utf-8 -*- # Scrapy settings for crawlProject project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'crawlProject' SPIDER_MODULES = ['crawlProject.spiders']
NEWSPIDER_MODULE = 'crawlProject.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'crawlProject (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
LOG_LEVEL = "ERROR"
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#} # Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'crawlProject.middlewares.CrawlprojectSpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'crawlProject.middlewares.CrawlprojectDownloaderMiddleware': 543,
#} # Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'crawlProject.pipelines.CrawlprojectPipeline': 300,
} # Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
settings.py
14-scrapy框架(CrawlSpider)的更多相关文章
- 全栈爬取-Scrapy框架(CrawlSpider)
引入 提问:如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬取进行实现(Request模块递归回调parse方法). 方法 ...
- Scrapy框架——CrawlSpider类爬虫案例
Scrapy--CrawlSpider Scrapy框架中分两类爬虫,Spider类和CrawlSpider类. 此案例采用的是CrawlSpider类实现爬虫. 它是Spider的派生类,Spide ...
- Scrapy框架——CrawlSpider爬取某招聘信息网站
CrawlSpider Scrapy框架中分两类爬虫,Spider类和CrawlSpider类. 它是Spider的派生类,Spider类的设计原则是只爬取start_url列表中的网页, 而Craw ...
- python爬虫之Scrapy框架(CrawlSpider)
提问:如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬去进行实现的(Request模块回调) 方法二:基于CrawlSpi ...
- 爬虫开发14.scrapy框架之分布式操作
分布式爬虫 一.redis简单回顾 1.启动redis: mac/linux: redis-server redis.conf windows: redis-server.exe redis-wi ...
- 网络爬虫之scrapy框架(CrawlSpider)
一.简介 CrawlSpider其实是Spider的一个子类,除了继承到Spider的特性和功能之外,还派生了其自己独有的更强大的特性和功能.其中最显著的功能就是"LinkExtractor ...
- Scrapy框架-CrawlSpider
目录 1.CrawlSpider介绍 2.CrawlSpider源代码 3. LinkExtractors:提取Response中的链接 4. Rules 5.重写Tencent爬虫 6. Spide ...
- Scrapy 框架 CrawlSpider 全站数据爬取
CrawlSpider 全站数据爬取 创建 crawlSpider 爬虫文件 scrapy genspider -t crawl chouti www.xxx.com import scrapy fr ...
- 爬虫Scrapy框架-Crawlspider链接提取器与规则解析器
Crawlspider 一:Crawlspider简介 CrawlSpider其实是Spider的一个子类,除了继承到Spider的特性和功能外,还派生除了其自己独有的更加强大的特性和功能.其中最显著 ...
- 16.Python网络爬虫之Scrapy框架(CrawlSpider)
引入 提问:如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬取进行实现(Request模块递归回调parse方法). 方法 ...
随机推荐
- linux安装数据时报错Could not execute auto check for display colors using command /usr/bin/xdpyinfo.
在redhat6.5上安装Oracle时,最后使用oracle用户执行runInstaller 报错如下,无法连接到安装有xmanager的windows服务器,也就无法图形化安装oracle ora ...
- Node接口实现HTTPS版的
最近由于自己要做一个微信小程序,接口地址只能是https的,这就很难受了 于是乎,我租了个服务器,搞了个免费的ssl认证 可是呢,我不会搞https接口怎样实现 今天特意花了一天时间来学,来学习 &q ...
- Linux & Go & Vscode & 插件
Linux Deepin 安装Go 安装Go环境 sudo apt-get install golang 验证一下: 输入 $ go env 输出 GOARCH="amd64" G ...
- css样式优先级计算规则
css样式的优先级分为引入优先级和声明优先级. 引入优先级 引入样式一般分为外部样式,内部样式,内联样式. 外部样式:使用link引入的外部css文件. 内部样式:使用style标签书写的css样式. ...
- 传统jdbc存在的问题总结
1.数据库连接创建.释放频繁造成系统资源浪费,影响系统性能,可使用数据库连接池解决此问题. 2.sql语句中在代码中硬编码,代码不易维护,sql变动需要改变java代码. 3.使用preparedSt ...
- Android中的常用控件之进度条(ProgressBar)
ProgressBar的常用属性:style,进度条的样式,默认为圆形,用style="?android:attr/progressBarStyleHorizontal"可以将进度 ...
- C# loop executed one by one wait the former completed
using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.T ...
- dataTable 表插入新行
DataRow dr = dt.NewRow();//定义新行 dr["sumPrice"] = sumPrice;//对应字段赋值 d ...
- Gradle之FTP文件下载
Gradle之FTP文件下载 1.背景 项目上需要使用本地web,所以我们直接将web直接放入assets资源文件夹下.但是随着开发进行web包越来越大:所以我们想着从版本库里面去掉web将其忽略掉, ...
- SQL Prompt提示和SQL默认智能提示冲突解决