笔记-scrapy-pipeline

木林森__𣛧 2024-09-02 20:10:23 原文

笔记-scrapy-pipeline

1.简介

scrapy抓取数据后，使用yield发送item对象至pipeline，pipeline顺序对item进行处理。

一般用于：

　　清洗，验证，检查数据；

　　存储数据；

2.使用

将数据保存到json文件中示例

import json

class JsonWriterPipeline(object):

　　def open_spider(self, spider):
　　　　self.file = open('items.jl', 'w')

　　def close_spider(self, spider):
　　　　self.file.close()

　　def process_item(self, item, spider):
　　　　line = json.dumps(dict(item)) + "\n"
　　　　self.file.write(line)
　　　　return item

3.类及方法介绍

process_item(self, item, spider)

This method is called for every item pipeline component. process_item() must either: return a dict with data, return an Item (or any descendant class) object, return a Twisted Deferred or raise DropItem exception. Dropped items are no longer processed by further pipeline components.

Parameters:
item (Item object or a dict) – the item scraped
spider (Spider object) – the spider which scraped the item

偶尔也会使用以下方法：

open_spider(self, spider)

This method is called when the spider is opened.

Parameters: spider (Spider object) – the spider which was opened

close_spider(self, spider)

This method is called when the spider is closed.

Parameters: spider (Spider object) – the spider which was closed

from_crawler(cls, crawler)

If present, this classmethod is called to create a pipeline instance from a Crawler. It must return a new instance of the pipeline. Crawler object provides access to all Scrapy core components like settings and signals; it is a way for pipeline to access them and hook its functionality into Scrapy.

Parameters: crawler (Crawler object) – crawler that uses this pipeline

4.更多用法

激活pipeline

如果想要使用pipeline，需要在settings文件中设置如下：

ITEM_PIPELINES = {
　　'myproject.pipelines.PricePipeline' ： 300 ，
　　'myproject.pipelines.JsonWriterPipeline' ： 800 ，
}

数值决定运行顺序，越小越优先，设置范围为0-1000。

笔记-scrapy-pipeline的更多相关文章

Scrapy笔记06- Item Pipeline
Scrapy笔记06- Item Pipeline 当一个item被蜘蛛爬取到之后会被发送给Item Pipeline,然后多个组件按照顺序处理这个item. 每个Item Pipeline组件其实就 ...
笔记-scrapy与twisted
笔记-scrapy与twisted Scrapy使用了Twisted作为框架,Twisted有些特殊的地方是它是事件驱动的,并且比较适合异步的代码. 在任何情况下,都不要写阻塞的代码.阻塞的代码包括: ...
scrapy pipeline
pipeline的四个方法 @classmethod def from_crawler(cls, crawler): """ 初始化的时候,用以创建pipeline对象 ...
redis学习笔记之pipeline
redis是一个cs模式的tcp server,使用和http类似的请求响应协议.一个client可以通过一个socket连接发起多个请求命令.每个请求命令发出后client通常会阻塞并等待redi ...
scrapy Pipeline使用twisted异步实现mysql数据插入
from twisted.enterprise import adbapi class MySQLAsyncPipeline: def open_spider(self, spider): db = ...
scrapy Pipeline 练习
class WeatherPipeline(object): def process_item(self, item, spider): print(item) return item #插入到red ...
Scrapy 初体验
开发笔记 Scrapy 初体验 scrapy startproject project_name 创建工程 scrapy genspider -t basic spider_name website. ...
Python Scrapy环境配置教程+使用Scrapy爬取李毅吧内容
Python爬虫框架Scrapy Scrapy框架 1.Scrapy框架安装直接通过这里安装scrapy会提示报错: error: Microsoft Visual C++ 14.0 is requ ...
scrapy项目5：爬取ajax形式加载的数据，并用ImagePipeline保存图片
1.目标分析: 我们想要获取的数据为如下图: 1).每本书的名称 2).每本书的价格 3).每本书的简介 2.网页分析: 网站url:http://e.dangdang.com/list-WY1-dd ...
Scrapy 下载文件和图片
我们学习了从网页中爬取信息的方法,这只是爬虫最典型的一种应用,除此之外,下载文件也是实际应用中很常见的一种需求,例如使用爬虫爬取网站中的图片.视频.WORD文档.PDF文件.压缩包等. 1.Files ...

随机推荐

oracle基本查询练习
select * from regions; select * from countries; select * from locations; select * from departments; ...
SharePoint 栏的三种名字Filed ：StaticName、 InternalName、 DisplayName
SharePoint 的栏,有3个名字, StaticName InternalName DisplayName. 当在第一次创建栏的时候,这3个名字一起进行创建,并且都一样. <FIELD ...
http：origin,referer和host区别
发起一个ajax请求时,request header里面有三个属性会涉及请求源信息.前端可能用不到这些值,但是,后台业务系统会比较关心它们,场景可能有: 处理跨域请求时,必须判断来源请求方是否合法:后 ...
NFS笔记（二）NFS服务器配置实例
一.NFS服务器配置实例实验拓扑二.实验要求及环境 2.1实验环境 NFS服务器 IP:192.168.8.5环境:[root@server7 ~]# uname -aLinux server7.c ...
UE4工具
COMMON CONTAINERS TARRAY (Engine\Source\Runtime\Core\Public\Containers\Array.h) TSET (Engine\Source\ ...
在ABAP里取得一个数据库表记录数的两种方法
方法1:使用函数EM_GET_NUMBER_OF_ENTRIES 这个函数使用起来很简单,只需要将想查询的数据库表名称维护进输入参数IT_TABLES: 上图说明这个函数支持批量操作,我查询的两张表名 ...
python IDE--pycharm安装及使用
官网 :http://www.jetbrains.com/pycharm/ 下载community版本,免费.下载之后傻瓜式安装即可. 1 启动pycharm,选择新建项目: 设置项目路径和项目名: ...
openwrt定制管理
版权声明:本文为博主原创文章,未经博主同意不得转载. https://blog.csdn.net/qianguozheng/article/details/24673097 近期这个比較火,可是改了东 ...
Linux空间PHP开发环境小白教程（LAMP）
租了一个云服务器, 但是只有linux系统,没有php开发环境, 只好自己摸索着一步一步安装啦. 本教程来自自学IT创E老师的Linux教程,想详细了解的可以去论坛找. 一.使用PUTTY登录服务器 ...
css3阴影 box-shadow
语法 box-shadow:X轴偏移量 y轴偏移量 [阴影模糊半径] [阴影扩展半径] [阴影颜色] [投影方式] 参数介绍: 注:inset 可以写在参数的第一个或最后一个,其它位置是无效的. 阴影 ...