爬虫06 /scrapy框架

爬虫06 /scrapy框架

1. scrapy概述/安装

异步的爬虫框架
- 高性能的数据解析，持久化存储，全栈数据的爬取，中间件，分布式
- Twisted：就是scrapy的异步机制，主要体现在下载器
框架：就是一个集成好了各种功能且具有很强通用性的一个项目模板。

环境安装：

Linux：

pip3 install scrapy

Windows：

a. pip3 install wheel

b. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

c. 进入下载目录，执行 pip3 install Twisted-18.9.0-cp36-cp36m-win_amd64.whl

d. pip3 install pywin32

e. pip3 install scrapy

2. 基本使用

1. 创建工程

新建一个工程：scrapy startproject proName
- settings.py:当前工程的配置文件
- spiders：爬虫包，必须要存放一个或者多个爬虫文件（.py）
切换到工程目录：cd proName
创建一个爬虫文件：scrapy genspider spiderName www.lbzhk.com

执行工程：scrapy crawl spiderName（爬虫文件名）

settings.py:(一般在创建工程后，先在settings中作如下设置)

不遵从robots协议
UA伪装
指定日志输出的类型：LOG_LEVEL = 'ERROR'

# 设置请求头USER_AGENT

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'

# 是否遵循robots协议

ROBOTSTXT_OBEY = False

# 记录日志的等级

LOG_LEVEL = 'ERROR'

2. 数据分析

response.xpath('xpath表达式')
scrapy中的xpath解析，在进行数据提取的时候，xpath方法返回的列表中存储的不再是字符串，而是存储的Selector对象，相关的字符串数据是存储在Selector对象的data参数中，我们必须使用
extract()/extract_first()进行字符串数据的提取

extract()：可以作用到列表中的每一个列表元素中，返回的依然是一个列表

extract_first()：只可以作用到列表中的第一个列表元素中，返回的是字符串

3. 持久化存储

基于终端指令的持久化存储

在parse方法中设置返回值
执行终端指令：scrapy crawl spiderName -o ./duanzi.csv

注意事项：

不能存入到数据库，只能对parse的返回值进行存储，且只能存入到指定后缀的文件中

代码示例：/在工程名文件夹下的spiders文件夹中创建要爬虫的文件

# -*- coding: utf-8 -*-

import scrapy

class FirstSpider(scrapy.Spider):

    # 爬虫名称：当前爬虫文件的唯一标识

    name = 'first'

    # 允许的域名

    # allowed_domains = ['www.baidu.com']

    # 起始的url列表：列表元素只可以是url

    # 作用：列表元素表示的url就会被进行请求发送

    start_urls = ['http://duanziwang.com/category/%E7%BB%8F%E5%85%B8%E6%AE%B5%E5%AD%90/']

    # 数据解析

    # 调用次数是由请求次数决定

    # def parse(self, response):

    #     article_list = response.xpath('/html/body/section/div/div/main/article')

    #     for article in article_list:

    #         # xpath在进行数据提取时，返回的不再是字符串而是一个Selector对象，想要的数据被包含在了该对象的data参数中

    #         # title = article.xpath('./div[1]/h1/a/text()')[0].extract()

    #         title = article.xpath('./div[1]/h1/a/text()').extract_first()

    #         content = article.xpath('./div[2]//text()').extract()

    #         content = ''.join(content)

    #         print(title,content)

    # 基于终端指令的持久化存储

    def parse(self, response):

        all_data = []

        article_list = response.xpath('/html/body/section/div/div/main/article')

        for article in article_list:

            # xpath在进行数据提取时，返回的不再是字符串而是一个Selector对象，想要的数据被包含在了该对象的data参数中

            # title = article.xpath('./div[1]/h1/a/text()')[0].extract()

            title = article.xpath('./div[1]/h1/a/text()').extract_first()

            content = article.xpath('./div[2]//text()').extract()

            content = ''.join(content)

            dic = {

                'title':title,

                'content':content

            }

            all_data.append(dic)

        return all_data   # 将解析到的数据进行了返回

基于管道的持久化存储

在爬虫文件中数据解析
将解析到的数据封装到一个叫做Item类型的对象
将item类型的对象提交给管道
管道负责调用process_item的方法接收item，然后进行某种形式的持久化存储
在配置文件中开启管道

注意事项：

一个管道类对应一种形式的持久化存储，当需要存到不同的数据库或文件中，需要用到多个管道类
process_item中的return item:可以将item提交给下一个即将被执行的管道类
如果直接将一个字典写入到redis报错的话/新版本不支持：pip install redis==2.10.6

代码示例：

settings配置文件

ITEM_PIPELINES = {

   'duanzi.pipelines.DuanziPipeline': 300,

}

定义一个Item类：items.py

import scrapy

class DuanziproItem(scrapy.Item):

    title = scrapy.Field()

    content = scrapy.Field()

爬虫文件：duanzi.py

import scrapy

from DuanziPro.items import DuanziproItem

class DuanziSpider(scrapy.Spider):

    name = 'duanzi'

    # allowed_domains = ['www.xxx.com']

    start_urls = ['http://duanziwang.com/category/%E7%BB%8F%E5%85%B8%E6%AE%B5%E5%AD%90/']

    def parse(self, response):

        article_list = response.xpath('/html/body/section/div/div/main/article')

        for article in article_list:

            title = article.xpath('./div[1]/h1/a/text()').extract_first()

            content = article.xpath('./div[2]//text()').extract()

            content = ''.join(content)

            # 实例化一个item类型的对象，然后将解析到的一组数据存进去

            item = DuanziproItem()

            item['title'] = title

            item['content'] = content

            yield item # 将item提交给管道

管道处理持久化存储：piplines.py

class DuanziproPipeline(object):

    fp = None

    def open_spider(self,spider):

        print('开始爬虫......')

        self.fp = open('./duanzi.txt','w',encoding='utf-8')

    # 方法每被调用一次，参数item就是其接收到的一个item类型的对象

    def process_item(self, item, spider):

        # print(item)  # item就是一个字典

        self.fp.write(item['title']+':'+item['content']+'\n')

        return item  # 可以将item提交给下一个即将被执行的管道类

    def close_spider(self,spider):

        self.fp.close()

        print('爬虫结束！！！')

多个管道类分别进行不同形式的存储

# 将数据写入到文本文件中

import pymysql

from redis import Redis

class DuanziproPipeline(object):

    fp = None

    def open_spider(self,spider):

        print('开始爬虫......')

        self.fp = open('./duanzi.txt','w',encoding='utf-8')

    # 方法每被调用一次，参数item就是其接收到的一个item类型的对象

    def process_item(self, item, spider):

        # print(item)  # item就是一个字典

        self.fp.write(item['title']+':'+item['content']+'\n')

        return item  # 可以将item提交给下一个即将被执行的管道类

    def close_spider(self,spider):

        self.fp.close()

        print('爬虫结束！！！')

# 将数据写入到mysql

class MysqlPipeLine(object):

    conn = None

    cursor = None

    def open_spider(self,spider):

        self.conn = pymysql.Connect(host='127.0.0.1',port=3306,user='root',password='222',db='spider',charset='utf8')

        print(self.conn)

    def process_item(self,item,spider):

        sql = 'insert into duanzi values ("%s","%s")'%(item['title'],item['content'])

        self.cursor = self.conn.cursor()

        try:

            self.cursor.execute(sql)

            self.conn.commit()

        except Exception as e:

            print(e)

            self.conn.rollback()

        return item

    def close_spider(self,spider):

        self.cursor.close()

        self.conn.close()

# 将数据写入到redis

class RedisPileLine(object):

    conn = None

    def open_spider(self,spider):

        self.conn = Redis(host='127.0.0.1',port=6379)

        print(self.conn)

    def process_item(self,item,spider):

        self.conn.lpush('duanziData',item)

        return item

3. 全栈数据的爬取

手动请求的发送

yield scrapy.Request(url=new_url,callback=self.parse)

# url：指定要发送请求的url

# callback：指定对请求结果做解析的回调函数

代码示例：

# 全栈数据爬取对应的操作

import scrapy

from DuanziPro.items import DuanziproItem

class DuanziSpider(scrapy.Spider):

    name = 'duanzi'

    start_urls = ['http://duanziwang.com/category/经典段子/']

    # 通用的url模板

    url = 'http://duanziwang.com/category/经典段子/%d/'

    pageNum = 1

    def parse(self, response):

        all_data = []

        article_list = response.xpath('/html/body/section/div/div/main/article')

        for article in article_list:

            title = article.xpath('./div[1]/h1/a/text()').extract_first()

            content = article.xpath('./div[2]//text()').extract()

            content = ''.join(content)

            # 实例化一个item类型的对象，然后将解析到的一组数据存进去

            item = DuanziproItem()

            item['title'] = title

            item['content'] = content

            yield item  # 将item提交给管道

        # 编写手动请求的操作

        if self.pageNum < 5:

            self.pageNum += 1

            print('正在下载的页码是：',self.pageNum)

            new_url = format(self.url%self.pageNum)

            yield scrapy.Request(url=new_url,callback=self.parse)

总结:/什么时候用yield

向管道提交item的时候
手动请求发送的时候

发送post请求

yield scrapy.FromRequest(url=new_url,callback=self.parse，formdata={})

# formdata:放post请求的参数

为什么start_urls列表可以自动进行get请求的发送，源码实现

# 父类对start_requests的原始实现：

class DuanziSpider(scrapy.Spider):

    name = 'duanzi'

    start_urls = ['http://duanziwang.com/category/经典段子/']

    # 通用的url模板

    url = 'http://duanziwang.com/category/经典段子/%d/'

    pageNum = 1

    def start_requests(self):

        for url in self.start_urls:

            yield scrapy.Request(url,callback=self.parse)

4. 五大核心组件/对象

五大核心组件的作用：
1. 引擎(ENGINE):
  
  用来处理整个系统的数据流处理, 触发事务(框架核心)
2. 调度器(Scheduler):
  
  用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL（抓取网页的网址或者说是链接）的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
3. 下载器(Downloader):
  
  用于下载网页内容, 并将网页内容返回给Spiders(Scrapy下载器是建立在twisted这个高效的异步模型上的)
4. 爬虫(Spiders):
  
  爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
5. 项目管道(Pipeline):
  
  负责处理爬虫从网页中抽取的实体，主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后，将被发送到项目管道，并经过几个特定的次序处理数据。

5. 适当提升scrapy爬取数据的效率

增加并发：

默认scrapy开启的并发线程为16个，可以适当进行增加。在settings配置文件中修改CONCURRENT_REQUESTS = 100值为100,并发设置成了为100。
降低日志级别：

在运行scrapy时，会有大量日志信息的输出，为了减少CPU的使用率。可以设置log输出信息为INFO或者ERROR即可。在配置文件中编写：LOG_LEVEL = ‘ERROR’
禁止cookie：

如果不是真的需要cookie，则在scrapy爬取数据时可以禁止cookie从而减少CPU的使用率，提升爬取效率。在配置文件中编写：COOKIES_ENABLED = False
禁止重试：

对失败的HTTP进行重新请求（重试）会减慢爬取速度，因此可以禁止重试。在配置文件中编写：RETRY_ENABLED = False
减少下载超时：

如果对一个非常慢的链接进行爬取，减少下载超时可以能让卡住的链接快速被放弃，从而提升效率。在配置文件中进行编写：DOWNLOAD_TIMEOUT = 10 超时时间为10s

6. 请求传参

作用：帮助scrapy实现深度爬取

深度爬取：爬取的数据没有在同一张页面中（例如：爬取图片时，首先是爬到图片的链接，再通过链接将图片爬取下来）
需求：爬取名称和简介，https://www.4567tv.tv/frim/index1.html

实现流程：

传参：

yield scrapy.Request(url,callback,meta), # 将meta这个字典传递给callback

接收参数：

response.meta

代码示例：

items.py

import scrapy

class MovieproItem(scrapy.Item):

    title = scrapy.Field()

    desc = scrapy.Field()

爬虫文件/深度爬取：movie.py

import scrapy

from moviePro.items import MovieproItem

class MovieSpider(scrapy.Spider):

    name = 'movie'

    start_urls = ['https://www.4567tv.tv/index.php/vod/show/class/动作/id/1.html']

    url = 'https://www.4567tv.tv/index.php/vod/show/class/动作/id/1/page/%d.html'

    def parse(self, response):

        li_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li')

        for li in li_list:

            title = li.xpath('.//div[@class="stui-vodlist__detail"]/h4/a/text()').extract_first()

            detail_url = 'https://www.4567tv.tv'+li.xpath('.//div[@class="stui-vodlist__detail"]/h4/a/@href').extract_first()

            item = MovieproItem()

            item['title'] = title

            # print(title,detail_url)

            # 对详情页的url进行手动请求发送

            # 请求传参：

                # 参数meta是一个字典，字典会传递给callback

            yield scrapy.Request(detail_url,callback=self.parse_detail,meta={'item':item})

    # 自定义的另一个解析方法（必须要有response参数）

    def parse_detail(self,response):

        # 接收传递过来的meta

        item = response.meta['item']

        desc = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[5]/span[2]/text()').extract_first()

        item['desc'] = desc

        yield item

爬虫文件/全栈爬取+深度爬取：movie.py

# 深度爬取+全栈爬取

import scrapy

from moviePro.items import MovieproItem

class MovieSpider(scrapy.Spider):

    name = 'movie'

    # allowed_domains = ['www.xxx.com']

    start_urls = ['https://www.4567tv.tv/index.php/vod/show/class/动作/id/1.html']

    url = 'https://www.4567tv.tv/index.php/vod/show/class/动作/id/1/page/%d.html'

    pageNum = 1

    def parse(self, response):

        li_list = response.xpath('/html/body/div[1]/div/div/div/div[2]/ul/li')

        for li in li_list:

            title = li.xpath('.//div[@class="stui-vodlist__detail"]/h4/a/text()').extract_first()

            detail_url = 'https://www.4567tv.tv'+li.xpath('.//div[@class="stui-vodlist__detail"]/h4/a/@href').extract_first()

            item = MovieproItem()

            item['title'] = title

            # print(title,detail_url)

            # 对详情页的url进行手动请求发送

            # 请求传参：

                #参数meta是一个字典，字典会传递给callback

            yield scrapy.Request(detail_url,callback=self.parse_detail,meta={'item':item})

            # 全栈爬取

            if self.pageNum < 4:

                self.pageNum += 1

                new_url = format(self.url%self.pageNum)

                yield scrapy.Request(new_url,callback=self.parse)

    # 自定义的另一个解析方法（必须要有response参数）

    def parse_detail(self,response):

        # 接收传递过来的meta

        item = response.meta['item']

        desc = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[5]/span[2]/text()').extract_first()

        item['desc'] = desc

        yield item

pipelines.py

class MovieproPipeline(object):

    def process_item(self, item, spider):

        print(item)

        return item

爬虫06 /scrapy框架的更多相关文章

Python网络爬虫之Scrapy框架（CrawlSpider）
目录 Python网络爬虫之Scrapy框架(CrawlSpider) CrawlSpider使用爬取糗事百科糗图板块的所有页码数据 Python网络爬虫之Scrapy框架(CrawlSpider) ...
Python逆向爬虫之scrapy框架,非常详细
爬虫系列目录目录 Python逆向爬虫之scrapy框架,非常详细一.爬虫入门 1.1 定义需求 1.2 需求分析 1.2.1 下载某个页面上所有的图片 1.2.2 分页 1.2.3 进行下载图片 ...
爬虫之scrapy框架
解析 Scrapy解释 Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架. 其可以应用在数据挖掘,信息处理或存储历史数据等一系列的程序中.其最初是为了页面抓取 (更确切来说, 网络抓 ...
Python爬虫进阶(Scrapy框架爬虫)
准备工作: 配置环境问题什么的我昨天已经写了,那么今天直接安装三个库首先第一步: ...
爬虫之Scrapy框架介绍
Scrapy介绍 Scrapy是用纯Python实现一个为了爬取网站数据.提取结构性数据而编写的应用框架,用途非常广泛. 框架的力量,用户只需要定制开发几个模块就可以轻松的实现一个爬虫,用来抓取网页内 ...
16.Python网络爬虫之Scrapy框架（CrawlSpider）
引入提问:如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬取进行实现(Request模块递归回调parse方法). 方法 ...
python爬虫随笔-scrapy框架(1)——scrapy框架的安装和结构介绍
scrapy框架简介 Scrapy,Python开发的一个快速.高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据.Scrapy用途广泛,可以用于数据挖掘.监测和自动化测试 ...
5、爬虫之scrapy框架
一 scrapy框架简介 1 介绍 Scrapy一个开源和协作的框架,其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的,使用它可以以快速.简单.可扩展的方式从网站中提取所需的数据.但目前Sc ...
Python学习---爬虫学习[scrapy框架初识]
Scrapy Scrapy是一个框架,可以帮助我们进行创建项目,运行项目,可以帮我们下载,解析网页,同时支持cookies和自定义其他功能. Scrapy是一个为了爬取网站数据,提取结构性数据而编写的 ...

随机推荐

EIGRP-11-弥散更新算法-EIGRP中的本地计算和弥散计算
至此,我们已经了解了诸多概念: RD (报告距离). CD (计算距离). FD (可行距离)和FC (可行性条件) ,在此基础上继续了解EIGRP对于拓扑变化的应对方法想必是轻松愉快的.能够导致拓 ...
[转] 间接系统调用syscall(SYS_gettid)
点击阅读原文在linux下每一个进程都一个进程id,类型pid_t,可以由 getpid()获取. POSIX线程也有线程id,类型pthread_t,可以由 pthread_self()获取,线程 ...
关于UDP的检验和计算(附代码)
关于UDP的检验和计算(附代码) 在下午的学习过程中https://www.cnblogs.com/roccoshi/p/13032356.html 有一张图讲述了UDP的校验方法, 如下: 老师只粗 ...
附024.Kubernetes全系列大总结
Kubernetes全系列总结如下,后期不定期更新.欢迎基于学习.交流目的的转载和分享,禁止任何商业盗用,同时希望能带上原文出处,尊重ITer的成果,也是尊重知识.若发现任何错误或纰漏,留言反馈或右侧 ...
Mac下安装octave
1.首先安装Command Line Tool xcode-select --install2.Mac OSX平台下,用神器Homebrew安装 curl -LsSf http://github.co ...
python3.6 + django2.0.6 + xadmin0.6
django2.0集成xadmin0.6报错集锦 http://www.lybbn.cn/data/bbsdatas.php?lybbs=50 1.django2.0把from django.core ...
Java中的四种引用方式
无论是通过引用计数算法判断对象的引用数量,还是通过可达性分析算法判断对象的引用链是否可达,判定对象是否存活都与"引用"有关.在Java语言中,将引用又分为强引用.软引用.弱引用 ...
OpenResty入门之使用Lua开发Nginx插件
记住一点:nginx配置文件很多坑来源自你的空格少了或多了. OpenResty OpenResty 是一个基于 Nginx 与 Lua 的高性能 Web 平台,其内部集成了大量精良的 Lua 库.第 ...
nmap二层发现
使用nmap进行arp扫描要使用一个参数:-sn,该参数表明屏蔽端口扫描而只进行arp扫描. nmap支持ip段扫描,命令:nmap -sn 192.168.1.0/24 nmap速度比arping快 ...
ES6 promise用法总结
一什么时候promise? promise是异步编程的一个解决方案,是一个构造函数,身上带着all,resolve,reject,原型上有cath,then等方法 promise有两个特点: 1 ...

爬虫06 /scrapy框架

爬虫06 /scrapy框架

1. scrapy概述/安装

2. 基本使用

1. 创建工程

2. 数据分析

3. 持久化存储

3. 全栈数据的爬取

4. 五大核心组件/对象

5. 适当提升scrapy爬取数据的效率

6. 请求传参

爬虫06 /scrapy框架的更多相关文章

随机推荐

热门专题