Items

Item objects are simple containers used to collect the scraped data.They provide a dictionary-like api with a convenient syntax for declaring their available fields.

import scrapy;

class Product(scrapy.Item):

  name=scrapy.Field()

  price=scrapy.Field()

  stock=scrapy.Field()

  last_updated=scrapy.Field(serializer=str)

Extending Items

you can extend Items(to add more fields or to change some metadata for some fields)by declaring a subclass of your original Item.

class DiscountedProduct(Product):

  discount_percent=scrapy.Field(serializer=str)

You can also extend fields metadata by using the previous field metadata and appending more values,or changind existing values.

class SpecificProduct(Product):

  name=scrapy.Field(Product.fields['name'],serializer=my_serializer)

Item Objects

1.class scrapy.item.Item([arg])

Return a new Item optionally initialized from the given argument

The only additional attribute provided by Items is:fields

2.Field objects

class scrapy.item.Field([arg])

The Field class is just an alias to the built-in dict class and doesn't provide any extra functionality or attributes.

_______________________________________________________________________________________________________________________________

Built-in spiders reference

Scrapy comes with some useful generic spiders that you can use,to subclass your spiders from .Their aim is to provide convenient functionality for a few common scraping cases.

class scrapy.spider.Spider

This is the simplest spider,and the one from which every other spider must inherit from.

重要属性:

name

A string which defines the name for this spider.it must be unique.This is the most important spider attribute and it is required.

allowed_domains

An optional list of strings containing domains that this spider is allowed to crawl.Requests for URLs not belonging to the domain names specified in this list won't be followed if offsiteMiddleware is enabled.

start_urls

A list of URLs where the spider will begin to crawl from,when no particular URLs are specified.

start_requests()

This is the method called by Scrapy when the spider is opened for scraping when no particular URLs are specified.If particular URLs are specified,the make_requests_from_url() is used instead to create the Requests.

make_requests_from_url(url)

A method that receives a URL and returns a Request object to scrape.Unless overridden,this method returns Requests with the parse() method as their callback function.

parse(response)

The parse method is in charge of processing the response and returning scraped data .

log(message[,level,component])

Log a message

closed(reason)

called when the spider closes.

class scrapy.contrib.spiders.CrawlSpider

This is the most commonly used spider for crawling regular websites,as it provides a convenient mechanism for following links by defining a set of rules.

除了继承自Spider的属性,CrawlSpider还提供以下属性。

rules

Which is a list of one or more Rule objects.Each Rule defines a certain behaviour for crawling the site.

关于Rule对象:

class scrapy.contrib.spiders.Rule(link_extractor,callback=None,cb_kwargs=None,follow=None,process_links=None,process_request=None)

link_extractor is a Link Extractor object which defines how links will be extracted from each crawled page.

callback is a callable or a string to be called for each link extracted with the specified link_extractor.

注意:when writing crawl spider rules,avoid using parse as callback,since the CrawlSpider uses the parse method itself to implement its logic.

cb_kwargs is a dict containing the keyword arguments to be passed to the callback function.

follow is a boolean which specifies if links should be followed from each response extracted with this rule.If callback is None,follow defaults to true(即继续爬取这个链接),otherwise it default to false.

process_request is a callable or a string which will be called with every request extracted by this rule,and must return a request or None.

------------------------------------------------------------------------------------------------------------------------------------

LinkExtractors are objects whose only purpose is to extract links from web pages(scrapy.http.Response objects).

Scrapy内置的Link Extractors有两个,可以根据需要自己来写。

All availble link extractors classes bundled with scrapy are provided in the scrapy.contrib.linkextractors module.

SgmlLinkExtractor

class scrapy.contrib.linkextractors.sgml.SgmlLinkExtractor(allow,...)

The SgmlLinkExtrator extends the base BaseSgmlLinkExtractor by providing additional filters that you can specify to extract links.

allow(a (a list of)regular expression):a single regular expression that the urls must match in order to be extracted,if not given,it will match all links.

【scrapy】Item及Spider的更多相关文章

  1. scrapy 原理,结构,基本命令,item,spider,selector简述

    原理,结构,基本命令,item,spider,selector简述 原理 (1)结构 (2)运行流程 实操 (1) scrapy命令: 注意先把python安装目录的scripts文件夹添加到环境变量 ...

  2. 爬虫(十六):Scrapy框架(三) Spider Middleware、Item Pipeline

    1. Spider Middleware Spider Middleware是介入到Scrapy的Spider处理机制的钩子框架. 当Downloader生成Response之后,Response会被 ...

  3. python爬虫入门(七)Scrapy框架之Spider类

    Spider类 Spider类定义了如何爬取某个(或某些)网站.包括了爬取的动作(例如:是否跟进链接)以及如何从网页的内容中提取结构化数据(爬取item). 换句话说,Spider就是您定义爬取的动作 ...

  4. [scrapy]Item Loders

    Items Items就是结构化数据的模块,相当于字典,比如定义一个{"title":"","author":""},i ...

  5. scrapy框架之spider

    爬取流程 Spider类定义如何爬取指定的一个或多个网站,包括是否要跟进网页里的链接和如何提取网页内容中的数据. 爬取的过程是类似以下步骤的循环: 1.通过指定的初始URL初始化Request,并指定 ...

  6. Scrapy框架之Spider模板 转

    一.安装scrapy 首先安装依赖库Twisted pip install (依赖库的路径) 在这个网址http://www.lfd.uci.edu/~gohlke/pythonlibs#twiste ...

  7. 第三百四十四节,Python分布式爬虫打造搜索引擎Scrapy精讲—craw母版l创建自动爬虫文件—以及 scrapy item loader机制

    第三百四十四节,Python分布式爬虫打造搜索引擎Scrapy精讲—craw母版l创建自动爬虫文件—以及 scrapy item loader机制 用命令创建自动爬虫文件 创建爬虫文件是根据scrap ...

  8. 二十三 Python分布式爬虫打造搜索引擎Scrapy精讲—craw母版l创建自动爬虫文件—以及 scrapy item loader机制

    用命令创建自动爬虫文件 创建爬虫文件是根据scrapy的母版来创建爬虫文件的 scrapy genspider -l  查看scrapy创建爬虫文件可用的母版 Available templates: ...

  9. scrapy item

    item item定义了爬取的数据的model item的使用类似于dict 定义 在items.py中,继承scrapy.Item类,字段类型scrapy.Field() 实例化:(假设定义了一个名 ...

随机推荐

  1. ignore-on-commit svn 更改文件后 默认不提交文件到服务器(服务器上已存在的文件)

    不用那个忽略文件那个,那个功能是删除服务器的文件,然后本地还存在,不符合我的要求 我的要求是 服务器文件在,我不动,然后我改完了,和别人的不冲突,我也不覆盖别人的文件 主要就是默认不提交,这个很重要 ...

  2. 【洛谷2019 OI春令营】期中考试

    T68402 扫雷 题目链接:传送门 题目描述 扫雷,是一款单人的计算机游戏.游戏目标是找出所有没有地雷的方格,完成游戏:要是按了有地雷的方格,游戏失败.现在 Bob 正在玩扫雷游戏,你作为裁判要判断 ...

  3. 解决Invalid bound statement (not found)(Mybatis的Mapper绑定问题)

    一.问题描述 使用mybatis的项目在本地可以正常运行,但当使用maven或Jenkins打包部署到服务器上时出现了绑定错误,异常信息为: org.apache.ibatis.binding.Bin ...

  4. CF716E Digit Tree 点分治

    题意: 给出一个树,每条边上写了一个数字,给出一个P,求有多少条路径按顺序读出的数字可以被P整除.保证P与10互质. 分析: 统计满足限制的路径,我们首先就想到了点分治. 随后我们就需要考量,我们是否 ...

  5. C++学习周记

    自开学到现在,原本可谓是对C++一无所知,也通过这几周的学习而渐渐有所了解. 最开始的编程任务虽然简单,但解决过程中却不乏磕绊,由一开始的中英文字符的不注意,到现在对一些函数的运用难免出错,出现bug ...

  6. 用python代码玩微信

    # 安装包 pip install -U wxpy from wxpy import * import time import json bot=Bot() my_friend = bot.frien ...

  7. 制作iso镜像U盘自动化安装linux系统

    自制光盘引导自动化安装 首先我们要明白都需要哪些文件,我们列举下 ①需要一个文件夹来存放文件,将来把这个目录打包成iso ②准备kickstart文件(ks.cfg) ③准备启动文件启动菜单 差不多也 ...

  8. ASP.NET Core on K8S学习初探(1)K8S单节点环境搭建

    当近期的一个App上线后,发现目前的docker实例(应用服务BFF+中台服务+工具服务)已经很多了,而我司目前没有专业的运维人员,发现运维的成本逐渐开始上来,所以容器编排也就需要提上议程.因此我决定 ...

  9. spring的IOC入门案例

    步骤: 一,导入jar 二,创建类,在类里创建方法 三,创建Spring配置文件,配置创建类 四,写代码测试对象创建

  10. fedora安装后,yum命令不能使用,Cannot retrieve metalink for repository: fedora. Please verify its path and try again 解决方法

    fedora安装后,yum命令不能使用,Cannot retrieve metalink for repository: fedora. Please verify its path and try ...