scrapy框架之持久化操作

基于终端指令的持久化存储
基于管道的持久化存储

1 基于终端指令的持久化存储

保证爬虫文件的parse方法中有可迭代类型对象（通常为列表or字典）的返回，该返回值可以通过终端指令的形式写入指定格式的文件中进行持久化操作。

执行输出指定格式进行存储：将爬取到的数据写入不同格式的文件中进行存储

    scrapy crawl 爬虫名称 -o xxx.json

    scrapy crawl 爬虫名称 -o xxx.xml

    scrapy crawl 爬虫名称 -o xxx.csv

以爬取糗事百科(https://www.qiushibaike.com/text/)为例

import scrapy

class QiubaiSpider(scrapy.Spider):

    name = 'qiubai'                                  # 表示该爬虫文件的名称

    allowed_domains = ['www.qiushibaike.com/text/']

    start_urls = ['https://www.qiushibaike.com/text/']

　　
　　# 解析函数

    def parse(self, response): # response就是对起始url发起请求后,对应的响应对象

        author_list = response.xpath('//div[@id="content-left"]/div')

        all_data = []

        for div in author_list:
　　　　　　　# extract_first()可以将xpath返回列表中第一个列表元素进行extract解析操作

            author = div.xpath('./div/a[2]/h2/text()').extract_first()
　　　　　　　# extract()可以将Selector对象中存储的数据进行解析操作
　　　　　　　 author = div.xpath('./div/a[2]/h2/text()').extract()

            content = div.xpath('./a/div/span/text()').extract_first()

            dict={

                'author':author,

                'content':content

            }

            all_data.append(dict)

        return all_data  # 可迭代的对象

在终端写入

执行输出指定格式进行存储：将爬取到的数据写入不同格式的文件中进行存储

    scrapy crawl 爬虫名称 -o xxx.json

    scrapy crawl 爬虫名称 -o xxx.xml

    scrapy crawl 爬虫名称 -o xxx.csv

2 基于管道的持久化存储

scrapy框架中已经为我们专门集成好了高效、便捷的持久化操作功能，我们直接使用即可。要想使用scrapy的持久化操作功能，我们首先来认识如下两个文件：

    items.py：数据结构模板文件。定义数据属性。

    pipelines.py：管道文件。接收数据（items），进行持久化操作。

持久化流程：

    1.爬虫文件爬取到数据后，需要将数据封装到items对象中。

    2.使用yield关键字将items对象提交给pipelines管道进行持久化操作。

    3.在管道文件中的process_item方法中接收爬虫文件提交过来的item对象，然后编写持久化存储的代码将item对象中存储的数据进行持久化存储

    4.settings.py配置文件中开启管道

1 爬虫文件qiubai.py

# -*- coding: utf- -*-

import scrapy

from ..items import FirstProjectItem

'''基于管道存储'''

'''

 爬虫文件中解析数据

 【items.py】将解析到的数据值全部分装在item对象中

 pipelines.py

 settings.py配置文件

'''

class QiubaiSpider(scrapy.Spider):

    name = 'qiubai'

    allowed_domains = ['www.qiushibaike.com/text/']

    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):

        author_list = response.xpath('//div[@id="content-left"]/div')

　　　　 for div in author_list:

            author = div.xpath('./div/a[2]/h2/text()').extract_first()

            # author = div.xpath('./div/a[2]/h2/text()')[].extract()

            content = div.xpath('./a/div/span/text()').extract_first()

　　　　　　　　----------------------------------------------------

            items = FirstProjectItem()

            items["author"] = author         重点

            items["content"] = content

            # 提交给管道

            yield items
　　　　　　　　----------------------------------------------------

2 items.py

import scrapy

# items会实例化一个items对象； 用来存储解析到的数据值

class FirstProjectItem(scrapy.Item):

    # define the fields for your item here like:
　　　-----------------------------------------

    author = scrapy.Field()

    content = scrapy.Field()    重点  你在第一步中有几个要持久化的这就写上对应的
     -----------------------------------------

3 pipelines.py

# 爬虫文件每向管道提交一次item则process_item方法就会被执行一次
class FirstProjectPipeline(object):
　　　　　　　　　　　　　　　　# item就是爬虫文件提交过来的

    def process_item(self, item, spider):

        return item

4 settings.py

# 第67行
ITEM_PIPELINES = {

   'first_project.pipelines.FirstProjectPipeline': ,

}

依据上面四步我们就学会了基本的“基于管道的持久化”的步骤，但是我们要在piplines.py做一些操作

只是修改第3步pipelines.py

# -*- coding: utf- -*-

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

class FirstProjectPipeline(object):

# 每次都会打开多次文件，我们重写 open_spider方法来开文件一次

    fp = None

    def open_spider(self, spider):

        print('开始爬虫')

        self.fp = open('qiubai1.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):

        self.fp.write(item['author']+':'+item["content"]+"\n")    # 生成qiubai1.txt文件

        return item

    def close_spider(self,spider):

        print('结束爬虫')

        self.fp.close()

3 写入数据库

import pymysql
class MysqlPipline(object):

    cursor = None

    conn = None

    def open_spider(self, spider):

        print('mysql开始')

        self.conn = pymysql.connect(host='127.0.0.1', user='root', password='123456', port=3306, db='s18',charset='utf8')

    def process_item(self, item, spider):

        sql = "insert into t_qiubai VALUES ('%s','%s')"%(item["author"], item["content"])

        self.cursor = self.conn.cursor()

        try:

            self.cursor.execute(sql)

            self.conn.commit()

        except Exception as e:

            self.conn.rollback()

        return item

    def close_spider(self, spider):

        print('mysql结束')

        self.cursor.close()

        self.conn.close()

settings.py

ITEM_PIPELINES = {

   'first_project.pipelines.FirstProjectPipeline': ,

   'first_project.pipelines.MysqlPipline': ,           # settings 配置      值越小 越优先

}

4 写入redis数据库

wins安装redis

import redis

class RedisPipline(object):

    r = None

    def open_spider(self, spider):

        print('redis开始')

        self.r = redis.Redis(host='127.0.0.1', port=6379)

    def process_item(self, item, spider):

        dict = {

            'author':item['author'],

            'content':item['content']

        }

        self.r.lpush('data', dict)

        return item

    def close_spider(self, spider):

        print('redis结束')

settings.py设置

ITEM_PIPELINES = {

   'first_project.pipelines.FirstProjectPipeline': 300,

   'first_project.pipelines.RedisPipline': 500,

}

我们可以去redis里面查看

key *   # 查看所有的key

lrange key 0 -1  # 从头到尾查看key

（六--二）scrapy框架之持久化操作的更多相关文章

scrapy框架之持久化操作
1.基于终端指令的持久化存储保证爬虫文件的parse方法中有可迭代类型对象(通常为列表or字典)的返回,该返回值可以通过终端指令的形式写入指定格式的文件中进行持久化操作. 执行输出指定格式进行存储: ...
爬虫开发8.scrapy框架之持久化操作
今日概要基于终端指令的持久化存储基于管道的持久化存储今日详情 1.基于终端指令的持久化存储保证爬虫文件的parse方法中有可迭代类型对象(通常为列表or字典)的返回,该返回值可以通过终端指令的 ...
scrapy框架之分布式操作
分布式概念分布式爬虫: 1.概念:多台机器上可以执行同一个爬虫程序,实现网站数据的分布爬取. 2.原生的scrapy是不可以实现分布式爬虫? a)调度器无法共享 b)管道无法共享 3.scrapy- ...
6 scrapy框架之分布式操作
分布式爬虫一.redis简单回顾 1.启动redis: mac/linux: redis-server redis.conf windows: redis-server.exe redis-wi ...
scrapy框架的持久化存储
一 . 基于终端指令的持久化存储保证爬虫文件的parse方法中有可迭代类型对象(通常为列表or字典)的返回,该返回值可以通过终端指令的形式写入指定格式的文件中进行持久化操作. 执行输出指定格式进行存 ...
爬虫开发14.scrapy框架之分布式操作
分布式爬虫一.redis简单回顾 1.启动redis: mac/linux: redis-server redis.conf windows: redis-server.exe redis-wi ...
scrapy框架之CrawlSpider操作
提问:如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬取进行实现(Request模块递归回调parse方法). 方法二:基 ...
爬虫开发11.scrapy框架之CrawlSpider操作
提问:如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬取进行实现(Request模块递归回调parse方法). 方法二:基 ...
Scrapy 框架，持久化文件相关
持久化相关相关文件 items.py 数据结构模板文件.定义数据属性. pipelines.py 管道文件.接收数据(items),进行持久化操作. 持久化流程 1.爬虫文件爬取到数据后,需要将数据 ...

随机推荐

Jquery - ajax url路径问题
Jquery - ajax url路径问题 2016年04月26日 09:59:27 yuxuac 阅读数 32308 版权声明:本文为博主原创文章,未经博主允许不得转载. https://bl ...
Vue - 引入本地图片的两种方式
第一种,只引入单个图片,这种引入方法在异步中引入则会报错. 比如需要遍历出很多图片展示时 <image :src = require('图片的路径') /> 第二种,可引入多个图片,也可引 ...
centos7 bond双网卡
[root@pay network-scripts]# cat ifcfg-bond0 |grep -v \#TYPE="Ethernet"PROXY_METHOD="n ...
window进行缩放时左侧菜单高度随之变化
window.onresize = function(){ $(); }
图解IDEA中配置Maven并创建Maven的Web工程
打开IDEA,File->Settings,如下图所示: 2.在Settings中按照如下进行配置,如下图所示:
六一些常用类：Random、BigInteger、BigDecimal、DecimalFormat
常用类:
2017北京网络赛 F Secret Poems 蛇形回路输出
#1632 : Secret Poems 时间限制:1000ms 单点时限:1000ms 内存限制:256MB 描述 The Yongzheng Emperor (13 December 1678 – ...
2-10 就业课(2.0)-oozie：11、hadoop的federation（联邦机制，了解一下）
==================================================== Hadoop Federation 背景概述单NameNode的架构使得HDFS在集群扩展性 ...
让SVG以组件的方式引入吧！
安装 npm i -D vue-svg-loader or yarn add -D vue-svg-loader webpack 配置 module.exports = { module: { rul ...
PHP处理大数据量老用户头像更新的操作--解决数据量大超时的问题
/** * @title 老用户头像更新--每3秒调用一次接口,每次更新10条数据 * @example user/createHeadPicForOldUser? * @method GET * @ ...

（六--二）scrapy框架之持久化操作