Scrapy shell使用

注意：容易出现403错误，实际爬取时不会出现。

response - a Response object containing the last fetched page

>>>response.xpath('//title/text()').extract()

return a list of selectors

>>>for index, link in enumerate(links):

... args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract()) ... print 'Link number %d points to url %s and image %s' % args

Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg'] Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg'] Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg'] Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg'] Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']

enumerate() 函数一般用在 for 循环当中。

>>> seq = ['one', 'two', 'three'] >>> for element in seq: ... print i, seq[i] ... i +=1 ... 0 one 1 two 2 three

one 1 two 2 three

suppose you want to extract all <p> elements inside <div> elements. First, you would get all <div> elements:

>>> divs = response.xpath('//div')

note the dot prefixing the .//p XPath):

>>> for p in divs.xpath('.//p'): # extracts all <p> inside ... print p.extract()

Another common case would be to extract all direct <p> children:

>>> for p in divs.xpath('p'): ... print p.extract()

在程序中使用shell

from scrapy.shell import inspect_response inspect_response(response, self)

Ctrl-D (or Ctrl-Z in Windows) to exit the shell and resume the crawling:

xpath最外层最好用单引号！

shell 本地html，方便调试（但别取名为index.html）

scrapy shell ./path/to/file.html ,即使在本目录，也必须要加./，不能直接 shell file.html scrapy shell ../other/path/to/file.html scrapy shell /absolute/path/to/file.html

Scrapy shell使用的更多相关文章

Scrapy shell调试网页的信息
通过scrapy shell "http://www.thinkive.cn:10000/zentaopms/www/index.php?m=user&f=login"
scrapy shell 中文网站输出报错.记录.
UnicodeDecodeError: 'gbk' codec can't decode bytes in position 381-382: illegal multibyte sequence 上 ...
安装ipython，使用scrapy shell来验证xpath选择的结果 | How to install iPython and how does it work with Scrapy Shell
1. scrapy shell 是scrapy包的一个很好的交互性工具,目前我使用它主要用于验证xpath选择的结果.安装好了scrapy之后,就能够直接在cmd上操作scrapy shell了. 具 ...
python爬虫scrapy之scrapy终端(Scrapy shell)
Scrapy终端是一个交互终端,供您在未启动spider的情况下尝试及调试您的爬取代码. 其本意是用来测试提取数据的代码,不过您可以将其作为正常的Python终端,在上面测试任何的Python代码. ...
Scrapy Shell的使用
Scrapy终端是一个交互终端,我们可以在未启动spider的情况下尝试及调试代码,也可以用来测试XPath或CSS表达式,查看他们的工作方式,方便我们爬取的网页中提取的数据. 如果安装了 IPyth ...
14.Scrapy Shell
Scrapy终端是一个交互终端,我们可以在未启动spider的情况下尝试及调试代码,也可以用来测试XPath或CSS表达式,查看他们的工作方式,方便我们爬取的网页中提取的数据. 如果安装了 IPyth ...
scrapy shell的作用
1.可以方便我们做一些数据提取的测试代码: 2.如果想要执行scrapy命令,那么毫无疑问,肯定是要先进入到scrapy所在的环境中: 3.如果想要读取某个项目的配置信息,那么应该先进入到这个项目中. ...
Scrapy shell调试返回403错误
一.问题描述有时候用scrapy shell来调试很方便,但是有些网站有防爬虫机制,所以使用scrapy shell会返回403,比如下面 C:\Users\fendo>scrapy shel ...
scrapy shell
一.scrapy shell 1.安装pip install Jupyter 2.在pycharm中的启动命令: scrapy shell 注:启动后关键字高亮显示 3.查看response 执行sc ...
在Scrapy项目【内外】使用scrapy shell命令抓取某网站首页的初步情况
Windows 10家庭中文版,Python 3.6.3,Scrapy 1.5.0, 时隔一月,再次玩Scrapy项目,希望这次可以玩的更进一步. 本文展示使用在 Scrapy项目内.项目外scrap ...

随机推荐

Centos 7 搭建蓝鲸3.1.5社区版
第一次搭建蓝鲸平台,参考了蓝鲸社区的官方搭建文档. 友情链接:蓝鲸智云社区版V3.1用户手册搭建时遇到了不少的坑,这里做一个详细的安装梳理主机硬件要求官方的推荐如下: 我在公司测试环境搭建时机器 ...
GitHub页面布局乱了，怎么解决？？
GitHub页面布局乱了,怎么解决?? GitHub乱了,怎么解决?? 一年一度的布局又乱了!!! F12一下下面有东西加载不了,,,, Refused to evaluate a string as ...
[Java - 调用WebService]{http://schemas.microsoft.com/ws/2005/05/addressing/none}ActionNotSupported
- Unable to find required classes (javax.activation.DataHandler and javax.mail.internet.MimeMultipar ...
Android：singleTask + onActivityResult
解决2个Activity互相跳转,并且栈中只保留每个Activity一个对象的存在. 在2个Activity中分别都要用到onActivityResult,所以就不能用launchMode=" ...
js重置form表单
CreateTime--2017年7月19日10:37:11Author:Marydon js重置form表单需要使用的方法:reset() 示例: HTML部分 <form id=&qu ...
排序(2)---------简单插入排序（C语言实现）
插入排序(Insertion Sort)的算法描写叙述是一种简单直观的排序算法. 它的工作原理是通过构建有序序列,对于未排序数据,在已排序序列中从后向前扫描,找到对应位置并插入.插入排序在实现上,通常 ...
【VBA编程】06.控制语句
[IF...THEN...语句] If condition Then [statements1] else [statements2] end if condition 为一个逻辑表达式,表示做选择时 ...
升级macOS Sierra系统导致错误 app: resource fork, Finder information, or similar detritus not allowed
前几天刚升级了macOS Sierra系统,顿时感觉入坑了,本来好好的项目报如下错误: app: resource fork, Finder information, or similar detri ...
js 时间毫秒
1. "2014-08-18 00:00:00" 与 13位毫秒互换 var oTime = { _format_13_time:function (str){ var tim ...
批量Linux、Windows管理工具BatchShell 1.2（最新版）
简介: BatchShell是什么: BatchShell是一款基于SSH2的批量文件传输及命令执行工具,它可以同时传输文件到多台远程服务器以及同时对多台远程服务器执行命令.具备以下主要功能: ...

Scrapy shell使用

Scrapy shell使用的更多相关文章

随机推荐

热门专题