scrapy奇技淫巧1

Request传递值到callback回调函数

def parse(self, response):

    request = scrapy.Request('http://www.example.com/index.html',

                             callback=self.parse_page2,

                             cb_kwargs=dict(main_url=response.url))

    request.cb_kwargs['foo'] = 'bar'  # add more arguments for the callback

    yield request

def parse_page2(self, response, main_url, foo):

    yield dict(

        main_url=main_url,

        other_url=response.url,

        foo=foo,

    )

Request传递cookie并且指定在一定范围的域名内存储接下来获得的cookie.

request_with_cookies = Request(url="http://www.example.com",

                               cookies=[{'name': 'currency',

                                        'value': 'USD',

                                        'domain': 'example.com',

                                        'path': '/currency'}])

自定义domain\path属性，可以实现——自动存储在这些网页中获得的cookie并自动加上，以后不用手动添加。如果网站要cookie才能进，表示只需要在开头发送一次就可以了。

把Request的dont_filter设置为True,可以防止它被反重复机制过滤，可以用在当想要对同一页面多此请求。Default to False.

使用Request的errback来捕获异常。

import scrapy

from scrapy.spidermiddlewares.httperror import HttpError

from twisted.internet.error import DNSLookupError

from twisted.internet.error import TimeoutError, TCPTimedOutError

class ErrbackSpider(scrapy.Spider):

    name = "errback_example"

    start_urls = [

        "http://www.httpbin.org/",              # HTTP 200 expected

        "http://www.httpbin.org/status/404",    # Not found error

        "http://www.httpbin.org/status/500",    # server issue

        "http://www.httpbin.org:12345/",        # non-responding host, timeout expected

        "http://www.httphttpbinbin.org/",       # DNS error expected

    ]

    def start_requests(self):

        for u in self.start_urls:

            yield scrapy.Request(u, callback=self.parse_httpbin,

                                    errback=self.errback_httpbin,

                                    dont_filter=True)

    def parse_httpbin(self, response):

        self.logger.info('Got successful response from {}'.format(response.url))

        # do something useful here...

    def errback_httpbin(self, failure):

        # log all failures

        self.logger.error(repr(failure))

        # in case you want to do something special for some errors,

        # you may need the failure's type:

        if failure.check(HttpError):

            # these exceptions come from HttpError spider middleware

            # you can get the non-200 response

            response = failure.value.response

            self.logger.error('HttpError on %s', response.url)

        elif failure.check(DNSLookupError):

            # this is the original request

            request = failure.request

            self.logger.error('DNSLookupError on %s', request.url)

        elif failure.check(TimeoutError, TCPTimedOutError):

            request = failure.request

            self.logger.error('TimeoutError on %s', request.url)

使用Request的replace方法构建一个新的Request，除了重新定义的值，其他的照搬原来的值。

replace([url, method, headers, body, cookies, meta, flags, encoding, priority, dont_filter, callback, errback, cb_kwargs])

返回具有相同成员的Request对象，但那些通过指定关键字参数赋予新值的成员除外。默认情况下，Request.cb_kwargs和Request.meta属性是浅复制的（除非给定新值作为参数）。

scrapy.FormRequest和`scrapy.FormRequest.from_response` https://blog.csdn.net/qq_33472765/article/details/80958820

具体可查看文档。from_response和FromRequest不同，后者是单纯的post，前者是与填充html页面的form字段，可以自己指定表单的位置（如果有多个），在模拟点击。如果是那些javascript的表单，也可以指定不模拟点击。

return [FormRequest(url="http://www.example.com/post/action",

                    formdata={'name': 'John Doe', 'age': '27'},

                    callback=self.after_post)]

网站通常会通过<input type =“ hidden”>元素（例如，与会话相关的数据或身份验证令牌（用于登录页面））提供预先填充的表单字段。抓取时，您会希望自动填充这些字段，并且仅覆盖其中的几个字段，例如用户名和密码。您可以将FormRequest.from_response（）方法用于此作业。

import scrapy

def authentication_failed(response):

    # TODO: Check the contents of the response and return True if it failed

    # or False if it succeeded.

    pass

class LoginSpider(scrapy.Spider):

    name = 'example.com'

    start_urls = ['http://www.example.com/users/login.php']

    def parse(self, response):

        return scrapy.FormRequest.from_response(

            response,

            formdata={'username': 'john', 'password': 'secret'},

            callback=self.after_login

        )

    def after_login(self, response):

        if authentication_failed(response):

            self.logger.error("Login failed")

            return

        # continue scraping with authenticated session...

scrapy奇技淫巧1的更多相关文章

Scrapy框架爬虫初探——中关村在线手机参数数据爬取
关于Scrapy如何安装部署的文章已经相当多了,但是网上实战的例子还不是很多,近来正好在学习该爬虫框架,就简单写了个Spider Demo来实践.作为硬件数码控,我选择了经常光顾的中关村在线的手机页面 ...
scrapy爬虫docker部署
spider_docker 接我上篇博客,为爬虫引用创建container,包括的模块:scrapy, mongo, celery, rabbitmq,连接https://github.com/Liu ...
scrapy 知乎用户信息爬虫
zhihu_spider 此项目的功能是爬取知乎用户信息以及人际拓扑关系,爬虫框架使用scrapy,数据存储使用mongo,下载这些数据感觉也没什么用,就当为大家学习scrapy提供一个例子吧.代码地 ...
ubuntu 下安装scrapy
1.把Scrapy签名的GPG密钥添加到APT的钥匙环中: sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 6272 ...
C#开发奇技淫巧三：把dll放在不同的目录让你的程序更整洁
系列文章 C#开发奇技淫巧一:调试windows系统服务 C#开发奇技淫巧二:根据dll文件加载C++或者Delphi插件 C#开发奇技淫巧三:把dll放在不同的目录让你的程序更整洁程序目录的整理 ...
网络爬虫：使用Scrapy框架编写一个抓取书籍信息的爬虫服务
上周学习了BeautifulSoup的基础知识并用它完成了一个网络爬虫( 使用Beautiful Soup编写一个爬虫系列随笔汇总 ), BeautifulSoup是一个非常流行的Python网 ...
Scrapy:为spider指定pipeline
当一个Scrapy项目中有多个spider去爬取多个网站时,往往需要多个pipeline,这时就需要为每个spider指定其对应的pipeline. [通过程序来运行spider],可以通过修改配置s ...
scrapy cookies：将cookies保存到文件以及从文件加载cookies
我在使用scrapy模拟登录新浪微博时,想将登录成功后的cookies保存到本地,下次加载它实现直接登录,省去中间一系列的请求和POST等.关于如何从本次请求中获取并在下次请求中附带上cookies的 ...
Scrapy开发指南
一.Scrapy简介 Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架. 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中. Scrapy基于事件驱动网络框架 Twis ...

随机推荐

OkHttp:NoClassDefFoundError
1 问题描述使用OkHttp时报错: Caused by: java.lang.NoClassDefFoundError: kotlin/jvm/internal/Intrinsics at okh ...
ORM 创新解放劳动力 -SqlSugar 新功能介绍
介绍 SqlSugar是一款老牌 .NET 开源ORM框架,由果糖大数据科技团队维护和更新 ,Github star数仅次于EF 和 Dapper 优点: 简单易用.功能齐全.高性能.轻量级.服务齐 ...
js的13种继承
js继承的13种方式也可以说只有12种,ES6的extend 也是12种方法之一-寄生继承的语法糖 1.原型链法代码示例 Child.prototype = new Parent(); 所属模式: ...
支持多线程的Redis6.0来了
支持多线程的 Redis 6.0 版本于 2020-05-02 终于发布了,为什么 Redis 忽然要支持多线程?如何开启多线程?开启后性能提升效果如何?线程数量该如何设置?开启多线程后会不会有线程安 ...
Rabbitmq 3.6.5以上版本修改端口号方法
Rabbitmq 3.6.5以上版本修改端口号方法,网上查了下有一些方不管用,所以自己实践了引用官网说明 https://www.rabbitmq.com/configure.html#configu ...
基于MATLAB的手写公式识别(9)
基于MATLAB的手写公式识别(9) 1.2图像的二值化 close all; clear all; Img=imread('drink.jpg'); %灰度化 Img_Gray=rgb2gray(I ...
『政善治』Postman工具 — 1、Postman介绍与安装
目录一.Postman介绍二.Postman下载与安装 1.Postman下载 2.Postman安装 3.为什么要注册Postman账号一.Postman介绍 Postman是一款非常流行的H ...
Android平台的so注入--LibInject
本文博客地址:http://blog.csdn.net/qq1084283172/article/details/53890315 大牛古河在看雪论坛分享的Android平台的注入代码,相信很多搞An ...
Android so加固的简单脱壳
本文博客地址:http://blog.csdn.net/qq1084283172/article/details/78077603 Android应用的so库文件的加固一直存在,也比较常见,特地花时间 ...
POJ3614奶牛晒阳光DINIC或者贪心
题意: n个区间,m种点,每种点有ci个,如果一个点的范围在一个区间上,那么就可以消耗掉一个区间,问最多可以消耗多少个区间,就是这n个区间中,有多少个可能被抵消掉. 思路: 方 ...

scrapy奇技淫巧1

scrapy.FormRequest和scrapy.FormRequest.from_response https://blog.csdn.net/qq_33472765/article/details/80958820

scrapy奇技淫巧1的更多相关文章

随机推荐

热门专题

scrapy.FormRequest和`scrapy.FormRequest.from_response` https://blog.csdn.net/qq_33472765/article/details/80958820