scrapy相关:splash 实践
0.
1.参考
https://github.com/scrapy-plugins/scrapy-splash#configuration
以此为准
scrapy相关:splash安装 A javascript rendering service 渲染
- 启动 Docker Quickstart Terminal
- 使用 putty 连接如下ip,端口22,用户名/密码:docker/tcuser
- 开启服务:
- sudo docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash
- 浏览器打开:http://192.168.99.100:8050/
docker is configured to use the default machine with IP 192.168.99.100
For help getting started, check out the docs at https://docs.docker.com Start interactive shell win7@win7-PC MINGW64 ~
$
2.实践
2.1新建项目后修改 settings.py
ROBOTSTXT_OBEY 改为 False,同时添加如下内容:
'''https://github.com/scrapy-plugins/scrapy-splash#configuration'''
# 1.Add the Splash server address to settings.py of your Scrapy project like this:
SPLASH_URL = 'http://192.168.99.100:8050' # 2.Enable the Splash middleware by adding it to DOWNLOADER_MIDDLEWARES in your settings.py file
# and changing HttpCompressionMiddleware priority:
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# Order 723 is just before HttpProxyMiddleware (750) in default scrapy settings.
# HttpCompressionMiddleware priority should be changed in order to allow advanced response processing;
# see https://github.com/scrapy/scrapy/issues/1895 for details. # 3.Enable SplashDeduplicateArgsMiddleware by adding it to SPIDER_MIDDLEWARES in your settings.py:
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
# This middleware is needed to support cache_args feature;
# it allows to save disk space by not storing duplicate Splash arguments multiple times in a disk request queue.
# If Splash 2.1+ is used the middleware also allows to save network traffic by not sending these duplicate arguments to Splash server multiple times. # 4.Set a custom DUPEFILTER_CLASS:
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' # 5.If you use Scrapy HTTP cache then a custom cache storage backend is required.
# scrapy-splash provides a subclass of scrapy.contrib.httpcache.FilesystemCacheStorage:
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
# If you use other cache storage then it is necesary to subclass it
# and replace all scrapy.util.request.request_fingerprint calls with scrapy_splash.splash_request_fingerprint. # Note
# Steps (4) and (5) are necessary because Scrapy doesn't provide a way to override request fingerprints calculation algorithm globally; this could change in future.
# There are also some additional options available. Put them into your settings.py if you want to change the defaults:
# SPLASH_COOKIES_DEBUG is False by default. Set to True to enable debugging cookies in the SplashCookiesMiddleware. This option is similar to COOKIES_DEBUG for the built-in scarpy cookies middleware: it logs sent and received cookies for all requests.
# SPLASH_LOG_400 is True by default - it instructs to log all 400 errors from Splash. They are important because they show errors occurred when executing the Splash script. Set it to False to disable this logging.
# SPLASH_SLOT_POLICY is scrapy_splash.SlotPolicy.PER_DOMAIN by default. It specifies how concurrency & politeness are maintained for Splash requests, and specify the default value for slot_policy argument for SplashRequest, which is described below.
2.2 编写基本 spider
# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest
from scrapy.shell import inspect_response
import base64
from PIL import Image
from io import BytesIO class CnblogsSpider(scrapy.Spider):
name = 'cnblogs'
allowed_domains = ['cnblogs.com']
start_urls = ['https://www.cnblogs.com/'] def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, args={'wait': 0.5}) def parse(self, response):
inspect_response(response, self) ########################
调试 view(response) 是个txt。。。另存为 html 使用浏览器浏览即可。
2.3 编写截图 spider
同时参考https://stackoverflow.com/questions/45172260/scrapy-splash-screenshots
def start_requests(self):
splash_args = {
'html': 1,
'png': 1,
#'width': 1024, #默认1027*768,4:3
#'render_all': 1, #长图截屏,不提供则是第一屏,需要同时提供 wait,否则报错
#'wait': 0.5, } for url in self.start_urls:
yield SplashRequest(url, self.parse, endpoint='render.json', args=splash_args)
http://splash.readthedocs.io/en/latest/api.html?highlight=wait#render-png
render_all=1
requires non-zero wait parameter. This is an unfortunate restriction, but it seems that this is the only way to make rendering work reliably with render_all=1
.
https://github.com/scrapy-plugins/scrapy-splash#responses
Responses
scrapy-splash returns Response subclasses for Splash requests:
- SplashResponse is returned for binary Splash responses - e.g. for /render.png responses;
- SplashTextResponse is returned when the result is text - e.g. for /render.html responses;
- SplashJsonResponse is returned when the result is a JSON object - e.g. for /render.json responses or /execute responses when script returns a Lua table.
SplashJsonResponse provide extra features:
response.data
attribute contains response data decoded from JSON; you can access it likeresponse.data['html']
.
show 另存文件
def parse(self, response):
# In [6]: response.data.keys()
# Out[6]: [u'title', u'url', u'geometry', u'html', u'png', u'requestedUrl'] imgdata = base64.b64decode(response.data['png'])
img = Image.open(BytesIO(imgdata))
img.show()
filename = 'some_image.png'
with open(filename, 'wb') as f:
f.write(imgdata)
inspect_response(response, self) ########################
x
scrapy相关:splash 实践的更多相关文章
- scrapy相关:splash安装 A javascript rendering service 渲染
0. splash: 美人鱼 溅,泼 1.参考 Splash使用初体验 docker在windows下的安装 https://blog.scrapinghub.com/2015/03/02/hand ...
- scrapy的splash 的简单使用
安装Splash(拉取镜像下来)docker pull scrapinghub/splash安装scrapy-splashpip install scrapy-splash启动容器docker run ...
- scrapy 相关
Spider类的一些自定制 # Spider类 自定义 起始解析器 def start_requests(self): for url in self.start_urls: yield Reques ...
- scrapy相关 通过设置 FEED_EXPORT_ENCODING 解决 unicode 中文写入json文件出现`\uXXXX`
0.问题现象 爬取 item: 2017-10-16 18:17:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.hu ...
- Scrapy对接Splash基础知识学习
一:什么是Splash Splash是一个 JavaScript渲染服务,是一个带有 HTTPAPI 的轻量级浏览器 1 功能介绍 利用 Splash,我们可以实现如下功能: 口异步方式处理多个网页渲 ...
- 【python】scrapy相关
目前scrapy还不支持python3,python2.7与python3.5共存时安装scrapy后,执行scrapy后报错 Traceback (most recent call last): F ...
- zookeeper 集群相关配置实践
一,zookeeper 集群下载及配置 1.1, 准备三台服务器node1,node2,node3. 1.2, [root@liunx local]#yum install -y java #安装ja ...
- 小白学 Python 爬虫(41):爬虫框架 Scrapy 入门基础(八)对接 Splash 实战
人生苦短,我用 Python 前文传送门: 小白学 Python 爬虫(1):开篇 小白学 Python 爬虫(2):前置准备(一)基本类库的安装 小白学 Python 爬虫(3):前置准备(二)Li ...
- Scrapy框架延迟请求之Splash的使用
Splash是什么,用来做什么 Splash, 就是一个Javascript渲染服务.它是一个实现了HTTP API的轻量级浏览器,Splash是用Python实现的,同时使用Twisted和QT.T ...
随机推荐
- Linux 学习 (四) 帮助命令
Linux达人养成计划 I 学习笔记 man 命令 获取指定命令的帮助 man的级别 1:查看命令的帮助 2:查看可被内核调用的函数的帮助 3:查看函数和函数库的帮助 4:查看特殊文件的帮助(主要是/ ...
- 膜拜rqy
今晚rqy大佬进行了一番演讲,说是演讲他自己都不大信... 不过今晚确实有收获. rqy大佬本身自带好学属性,我在初中部机房就只有打游戏,就此来说我无法与他比较.所以我们之间的差距显然早就巨大化.他自 ...
- MySQL报错: SQLSTATE[HY000]: General error: 1030 Got error 28 from storage engine
执行命令:df -h [root@iZ25z6qcmrhZ ~]# df -hFilesystem Size Used Avail Use% Mounted on/dev/xvda1 40G 38G ...
- mybatis中的几个注意的地方
1.首先定义一个sql标签,一定要定义唯一id<sql id="Base_Column_List" >name,age</sql>2.然后通过id引用< ...
- Git让你从入门到精通,看这一篇就够了
简介 Git 是什么? Git 是一个开源的分布式版本控制系统. 什么是版本控制? 版本控制是一种记录一个或若干文件内容变化,以便将来查阅特定版本修订情况的系统. 什么是分布式版本控制系统? 介绍分布 ...
- Linux(Ubuntu)换apt-get源
在虚拟机安装完Ubuntu后,因为apt-get命令默认的服务器在国外会很慢,换成国内的会快很多 选一个国内镜像源,以清华大学开源镜像为例,要选对应的Ubuntu版本 网站链接https://mirr ...
- 修改帝国cms栏目后,如何更新
修改栏目后,要依次做如下更新: 1. 2. 3. 如果只是修改了栏目里的属性,只需要做第三步就行了
- BZOJ 3192: [JLOI2013]删除物品(树状数组)
题面: https://www.lydsy.com/JudgeOnline/problem.php?id=3192 题解: 首先每次一定是来回移动直到最大的到顶上. 所以我们可以将第两个堆的堆顶接起来 ...
- 用IntelliJ IDEA 开发Spring+SpringMVC+Mybatis框架 分步搭建二:配置MyBatis 并测试(1 构建目录环境和依赖)
引言:在用IntelliJ IDEA 开发Spring+SpringMVC+Mybatis框架 分步搭建一 的基础上 继续进行项目搭建 该部分的主要目的是测通MyBatis 及Spring-dao ...
- zabbix存储history_text
有一个监控项存储一个目录的所有文件(递归)信息,字符数量比较大,history_str表的value的字段字符数限制为255长度,所以就想存储到history_text表中,在最新数据中一直显示不出新 ...