0.

1.参考

https://github.com/scrapy-plugins/scrapy-splash#configuration

以此为准

scrapy相关:splash安装 A javascript rendering service 渲染

  1. 启动 Docker Quickstart Terminal
  2. 使用 putty 连接如下ip,端口22,用户名/密码:docker/tcuser
  3. 开启服务:
    1.   sudo docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash
  4. 浏览器打开:http://192.168.99.100:8050/
  1. docker is configured to use the default machine with IP 192.168.99.100
  2. For help getting started, check out the docs at https://docs.docker.com
  3.  
  4. Start interactive shell
  5.  
  6. win7@win7-PC MINGW64 ~
  7. $

2.实践

2.1新建项目后修改 settings.py

ROBOTSTXT_OBEY 改为 False,同时添加如下内容:

  1. '''https://github.com/scrapy-plugins/scrapy-splash#configuration'''
  2. # 1.Add the Splash server address to settings.py of your Scrapy project like this:
  3. SPLASH_URL = 'http://192.168.99.100:8050'
  4.  
  5. # 2.Enable the Splash middleware by adding it to DOWNLOADER_MIDDLEWARES in your settings.py file
  6. # and changing HttpCompressionMiddleware priority:
  7. DOWNLOADER_MIDDLEWARES = {
  8. 'scrapy_splash.SplashCookiesMiddleware': 723,
  9. 'scrapy_splash.SplashMiddleware': 725,
  10. 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
  11. }
  12. # Order 723 is just before HttpProxyMiddleware (750) in default scrapy settings.
  13. # HttpCompressionMiddleware priority should be changed in order to allow advanced response processing;
  14. # see https://github.com/scrapy/scrapy/issues/1895 for details.
  15.  
  16. # 3.Enable SplashDeduplicateArgsMiddleware by adding it to SPIDER_MIDDLEWARES in your settings.py:
  17. SPIDER_MIDDLEWARES = {
  18. 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
  19. }
  20. # This middleware is needed to support cache_args feature;
  21. # it allows to save disk space by not storing duplicate Splash arguments multiple times in a disk request queue.
  22. # If Splash 2.1+ is used the middleware also allows to save network traffic by not sending these duplicate arguments to Splash server multiple times.
  23.  
  24. # 4.Set a custom DUPEFILTER_CLASS:
  25. DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
  26.  
  27. # 5.If you use Scrapy HTTP cache then a custom cache storage backend is required.
  28. # scrapy-splash provides a subclass of scrapy.contrib.httpcache.FilesystemCacheStorage:
  29. HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
  30. # If you use other cache storage then it is necesary to subclass it
  31. # and replace all scrapy.util.request.request_fingerprint calls with scrapy_splash.splash_request_fingerprint.
  32.  
  33. # Note
  34. # Steps (4) and (5) are necessary because Scrapy doesn't provide a way to override request fingerprints calculation algorithm globally; this could change in future.
  35. # There are also some additional options available. Put them into your settings.py if you want to change the defaults:
  36. # SPLASH_COOKIES_DEBUG is False by default. Set to True to enable debugging cookies in the SplashCookiesMiddleware. This option is similar to COOKIES_DEBUG for the built-in scarpy cookies middleware: it logs sent and received cookies for all requests.
  37. # SPLASH_LOG_400 is True by default - it instructs to log all 400 errors from Splash. They are important because they show errors occurred when executing the Splash script. Set it to False to disable this logging.
  38. # SPLASH_SLOT_POLICY is scrapy_splash.SlotPolicy.PER_DOMAIN by default. It specifies how concurrency & politeness are maintained for Splash requests, and specify the default value for slot_policy argument for SplashRequest, which is described below.

2.2 编写基本 spider

  1. # -*- coding: utf-8 -*-
  2. import scrapy
  3. from scrapy_splash import SplashRequest
  4. from scrapy.shell import inspect_response
  5. import base64
  6. from PIL import Image
  7. from io import BytesIO
  8.  
  9. class CnblogsSpider(scrapy.Spider):
  10. name = 'cnblogs'
  11. allowed_domains = ['cnblogs.com']
  12. start_urls = ['https://www.cnblogs.com/']
  13.  
  14. def start_requests(self):
  15. for url in self.start_urls:
  16. yield SplashRequest(url, self.parse, args={'wait': 0.5})
  17.  
  18. def parse(self, response):
  19. inspect_response(response, self) ########################

调试 view(response) 是个txt。。。另存为 html 使用浏览器浏览即可。

2.3 编写截图 spider

同时参考https://stackoverflow.com/questions/45172260/scrapy-splash-screenshots

  1. def start_requests(self):
  2. splash_args = {
  3. 'html': 1,
  4. 'png': 1,
  5. #'width': 1024, #默认1027*768,4:3
  6. #'render_all': 1, #长图截屏,不提供则是第一屏,需要同时提供 wait,否则报错
  7. #'wait': 0.5,
  8.  
  9. }
  10.  
  11. for url in self.start_urls:
  12. yield SplashRequest(url, self.parse, endpoint='render.json', args=splash_args)

http://splash.readthedocs.io/en/latest/api.html?highlight=wait#render-png

render_all=1 requires non-zero wait parameter. This is an unfortunate restriction, but it seems that this is the only way to make rendering work reliably with render_all=1.

https://github.com/scrapy-plugins/scrapy-splash#responses

Responses

scrapy-splash returns Response subclasses for Splash requests:

  • SplashResponse is returned for binary Splash responses - e.g. for /render.png responses;
  • SplashTextResponse is returned when the result is text - e.g. for /render.html responses;
  • SplashJsonResponse is returned when the result is a JSON object - e.g. for /render.json responses or /execute responses when script returns a Lua table.

SplashJsonResponse provide extra features:

  • response.data attribute contains response data decoded from JSON; you can access it like response.data['html'].

show 另存文件

  1. def parse(self, response):
  2. # In [6]: response.data.keys()
  3. # Out[6]: [u'title', u'url', u'geometry', u'html', u'png', u'requestedUrl']
  4.  
  5. imgdata = base64.b64decode(response.data['png'])
  6. img = Image.open(BytesIO(imgdata))
  7. img.show()
  8. filename = 'some_image.png'
  9. with open(filename, 'wb') as f:
  10. f.write(imgdata)
  11. inspect_response(response, self) ########################

x

scrapy相关:splash 实践的更多相关文章

  1. scrapy相关:splash安装 A javascript rendering service 渲染

    0. splash: 美人鱼  溅,泼 1.参考 Splash使用初体验 docker在windows下的安装 https://blog.scrapinghub.com/2015/03/02/hand ...

  2. scrapy的splash 的简单使用

    安装Splash(拉取镜像下来)docker pull scrapinghub/splash安装scrapy-splashpip install scrapy-splash启动容器docker run ...

  3. scrapy 相关

    Spider类的一些自定制 # Spider类 自定义 起始解析器 def start_requests(self): for url in self.start_urls: yield Reques ...

  4. scrapy相关 通过设置 FEED_EXPORT_ENCODING 解决 unicode 中文写入json文件出现`\uXXXX`

    0.问题现象 爬取 item: 2017-10-16 18:17:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.hu ...

  5. Scrapy对接Splash基础知识学习

    一:什么是Splash Splash是一个 JavaScript渲染服务,是一个带有 HTTPAPI 的轻量级浏览器 1 功能介绍 利用 Splash,我们可以实现如下功能: 口异步方式处理多个网页渲 ...

  6. 【python】scrapy相关

    目前scrapy还不支持python3,python2.7与python3.5共存时安装scrapy后,执行scrapy后报错 Traceback (most recent call last): F ...

  7. zookeeper 集群相关配置实践

    一,zookeeper 集群下载及配置 1.1, 准备三台服务器node1,node2,node3. 1.2, [root@liunx local]#yum install -y java #安装ja ...

  8. 小白学 Python 爬虫(41):爬虫框架 Scrapy 入门基础(八)对接 Splash 实战

    人生苦短,我用 Python 前文传送门: 小白学 Python 爬虫(1):开篇 小白学 Python 爬虫(2):前置准备(一)基本类库的安装 小白学 Python 爬虫(3):前置准备(二)Li ...

  9. Scrapy框架延迟请求之Splash的使用

    Splash是什么,用来做什么 Splash, 就是一个Javascript渲染服务.它是一个实现了HTTP API的轻量级浏览器,Splash是用Python实现的,同时使用Twisted和QT.T ...

随机推荐

  1. vhdl 数组

    TYPE matrix_index is array (511 downto 0) of std_logic_vector(7 downto 0);signal cnt_freq : matrix_i ...

  2. Django+Vue打造购物网站(四)

    首页商品类别数据显示 商品分类接口 大概需要两个,一个显示三个类别 一个显示类别及类别下的全部商品 现在开始写商品的接口 首先编写三个分类的serializer class CategorySeria ...

  3. 深入理解JVM(3)——类加载机制

    1.类加载时机 类的整个生命周期包括了:加载( Loading ).验证( Verification ).准备( Preparation ).解析( Resolution ).初始化( Initial ...

  4. GWAS分析基本流程及分析思路

    数据预处理(DNA genotyping.Quality control.Imputation) QC的工作可以做PLINK上完成Imputation的工作用IMPUTE2完成 2. 表型数据统计分析 ...

  5. js上传图片压缩,并转化为base64

    <input type="file" onchange="startUpload(this,'front')" id="renm"/& ...

  6. 2018-2019-2 《Java程序设计》第2周学习总结

    20175319 2018-2019-2 <Java程序设计>第2周学习总结 教材学习内容总结 第二周学习了<Java2实用教程>第二章.第三章的内容关于Java基本数据类型与 ...

  7. [数分提高]2014-2015-2第6教学周第1次课讲义 3.3 Taylor 公式

    1. (Taylor 公式). 设 $f^{(n)}$ 在 $[a,b]$ 上连续, $f^{(n+1)}$ 在 $(a,b)$ 内存在, 试证: $ \forall\ x,x_0\in [a,b], ...

  8. SSRF漏洞挖掘经验

    SSRF概述 SSRF(Server-Side Request Forgery:服务器端请求伪造) 是一种由攻击者构造形成由服务端发起请求的一个安全漏洞.一般情况下,SSRF攻击的目标是从外网无法访问 ...

  9. RT-SA-2019-005 Cisco RV320 Command Injection Retrieval

    Advisory: Cisco RV320 Command Injection RedTeam Pentesting discovered a command injection vulnerabil ...

  10. react-router v4 按需加载的配置方法

    在react项目开发中,当访问默认页面时会一次性请求所有的js资源,这会大大影响页面的加载速度和用户体验.所以添加按需加载功能是必要的,以下是配置按需加载的方法: 安装bundle-loader np ...