安装完scrapy后,创建一个新的工程:

scrapy startproject tutorial

会创建一个tutorial文件夹有以下的文件:
tutorial/
scrapy.cfg
tutorial/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...

These are basically:

  • scrapy.cfg: the project configuration file
  • tutorial/: the project’s python module, you’ll later import your code from here.
  • tutorial/items.py: the project’s items file.
  • tutorial/pipelines.py: the project’s pipelines file.
  • tutorial/settings.py: the project’s settings file.
  • tutorial/spiders/: a directory where you’ll later put your spiders.

Defining our Item

Items are containers that will be loaded with the scraped data; they work like simple python dicts but provide additional protecting against populating undeclared fields, to prevent typos.

items是我们将会装入的数据的容器。他们类似python 字典但是提供了附加的保护如防止填充未声明的段等。

They are declared by creating an scrapy.item.Item class an defining its attributes as scrapy.item.Field objects, like you will in an ORM (don’t worry if you’re not familiar with ORMs, you will see that this is an easy task).

We begin by modeling the item that we will use to hold the sites data obtained from dmoz.org, as we want to capture the name, url and description of the sites, we define fields for each of these three attributes. To do that, we edit items.py, found in the tutorial directory. Our Item class looks like this:

它通过建立一个scrapy.item.Item的类来生命,定义它的属性为scrpiy.item.Field对象,就像你在一个ORM中.

我们通过将我们需要的条目模型化来控制从dmoz.org获得的数据,比如我们要获得网站的名字,url和网站描述,我们定义这三种属性的范围,为了达到目的,我们编辑在dmoz目录下的items.py文件,我们的Item类将会是这样

from scrapy.item import Item, Field

class DmozItem(Item):
title = Field()
link = Field()
desc = Field()

This may seem complicated at first, but defining the item allows you to use other handy components of Scrapy that need to know how your item looks like.

这个首先看起来有点复杂,但是定义这些条目让你用其他Scrapy的组件的时候你能够知道你的 items到底是如何定义。。

Our first Spider

Spiders are user-written classes used to scrape information from a domain (or group of domains). spider是用户写的类用来scrapy信息。

They define an initial list of URLs to download, how to follow links, and how to parse the contents of those pages to extract items.

它定义了一个初始的url列表来下载,如何follow link,如何解析页面提取到items。

To create a Spider, you must subclass scrapy.spider.BaseSpider, and define the three main, mandatory, attributes:

为了创建一个spider,你必须创建一个scrapy.spider.BaseSpider的子类,然后定义3个主要的必须的属性。

  • name: identifies the Spider. It must be unique, that is, you can’t set the same name for different Spiders.

  • start_urls: is a list of URLs where the Spider will begin to crawl from. So, the first pages downloaded will be those listed here. The subsequent URLs will be generated successively from data contained in the start URLs.

  • 这是一个URL列表,爬虫从这里开始抓取数据,所以,第一次下载的数据将会从这些URLS开始。 后来计算的所有子URL将会从这些URL中开始计算
  • parse() is a method of the spider, which will be called with the downloaded Response object of each start URL. The response is passed to the method as the first and only argument.This method is responsible for parsing the response data and extracting scraped data (as scraped items) and more URLs to follow.

  • 是一个爬虫的方法,调用时候传入从每一个URL传回的Response对象作为参数,response将会是parse方法的唯一的一个参数,这个方法负责解析返回的response数据匹配抓取的数据(解析为items)和其他的URL.

    The parse() method is in charge of processing the response and returning scraped data (as Item objects) and more URLs to follow (as Requestobjects). parse()方法负责处理response,返回scrapy的数据(作为item对象)。

This is the code for our first Spider; save it in a file named dmoz_spider.py under the dmoz/spiders directory:

from scrapy.spider import BaseSpider

class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
] def parse(self, response):
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)

name #爬虫的ID,该属性必须唯一

Crawling

To put our spider to work, go to the project’s top level directory and run:

scrapy crawl dmoz

The crawl dmoz command runs the spider for the dmoz.org domain. You will get an output similar to this:

:51:13-0300 [scrapy] INFO: Started project: dmoz
2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled extensions: ...
2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled downloader middlewares: ...
2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled spider middlewares: ...
2008-08-20 03:51:13-0300 [tutorial] INFO: Enabled item pipelines: ...
2008-08-20 03:51:14-0300 [dmoz] INFO: Spider opened
2008-08-20 03:51:14-0300 [dmoz] DEBUG: Crawled <http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: <None>)
2008-08-20 03:51:14-0300 [dmoz] DEBUG: Crawled <http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: <None>)
2008-08-20 03:51:14-0300 [dmoz] INFO: Spider closed (finished)

Pay attention to the lines containing [dmoz], which corresponds to our spider. You can see a log line for each URL defined in start_urls. Because these URLs are the starting ones, they have no referrers, which is shown at the end of the log line, where it says (referer: <None>).

注意有 [dmoz.org]的输出 ,对我们的爬虫做出的结果(identified by the domain "dmoz.org"). 你可以看见在start_urls中定义的一些URL的一些输出。因为这些URL是起始页面,所以他们没有引用(referrers),所以在每行的末尾你会看到 (referer: <None>).

But more interesting, as our parse method instructs, two files have been created: Books and Resources, with the content of both URLs.

有趣的是,在我们的 parse  方法的作用下,两个文件被创建 Books 和 Resources, 这两个文件中有着URL的页面内容。

(在顶层目录下有了2个新文件。 Books 和 Resources,分别是2个url网页的内容)

What just happened under the hood?

Scrapy creates scrapy.http.Request objects for each URL in the start_urls attribute of the Spider, and assigns them the parse method of the spider as their callback function.

These Requests are scheduled, then executed, and scrapy.http.Response objects are returned and then fed back to the spider, through the parse()method.

发生了什么事情?Scrapy为爬虫属性中的 start_urls中的每个URL创建了一个 scrapy.http.Request 对象 , 为他们指定爬虫的 parse 方法作为回调。

这些 Request首先被计划,然后被执行, 之后 scrapy.http.Response 对象通过parse() 方法返回给爬虫.

Extracting Items

Introduction to Selectors

There are several ways to extract data from web pages. Scrapy uses a mechanism based on XPath expressions called XPath selectors. For more information about selectors and other extraction mechanisms see the XPath selectors documentation.

Here are some examples of XPath expressions and their meanings:

  • /html/head/title: selects the <title> element, inside the <head> element of a HTML document
  • /html/head/title/text(): selects the text inside the aforementioned <title> element.
  • //td: selects all the <td> elements
  • //div[@class="mine"]: selects all div elements which contain an attribute class="mine"

These are just a couple of simple examples of what you can do with XPath, but XPath expressions are indeed much more powerful. To learn more about XPath we recommend this XPath tutorial.

For working with XPaths, Scrapy provides a XPathSelector class, which comes in two flavours, HtmlXPathSelector (for HTML data) andXmlXPathSelector (for XML data). In order to use them you must instantiate the desired class with a Response object.

You can see selectors as objects that represent nodes in the document structure. So, the first instantiated selectors are associated to the root node, or the entire document.

为了方便使用XPaths, Scrapy提供XPathSelector 类, 一共有两种, HtmlXPathSelector (HTML数据解析) 和XmlXPathSelector (XML数据解析). 为了使用他们你必须通过一个 Response 对象对他们进行实例化操作.

你会发现Selector对象展示了文档的节点结构.所以,首先被实例化的selector与跟节点或者是整个目录有关 。

Selectors have three methods (click on the method to see the complete API documentation).

  • select(): returns a list of selectors, each of them representing the nodes selected by the xpath expression given as argument.

  • extract(): returns a unicode string with

    the data selected by the XPath selector. 提取unicode字符串。

  • re(): returns a list of unicode strings extracted by applying the regular expression given as argument.

Trying Selectors in the Shell

To illustrate the use of Selectors we’re going to use the built-in Scrapy shell, which also requires IPython (an extended Python console) installed on your system.

尝试在交互环境中使用Selectors为了举例说明Selectors的用法我们将用到 Scrapy shell, 需要使用ipython (一个扩展python交互环境) 。

To start a shell, you must go to the project’s top level directory and run:

scrapy shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/

输出:

[ ... Scrapy log here ... ]

[s] Available Scrapy objects:
[s] 2010-08-19 21:45:59-0300 [default] INFO: Spider closed (finished)
[s] hxs <HtmlXPathSelector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None>
[s] item Item()
[s] request <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s] spider <BaseSpider 'default' at 0x1b6c2d0>
[s] xxs <XmlXPathSelector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=None>
[s] Useful shortcuts:
[s] shelp() Print this help
[s] fetch(req_or_url) Fetch a new request or URL and update shell objects
[s] view(response) View response in a browser In [1]:

After the shell loads, you will have the response fetched in a local response variable, so if you type response.body you will see the body of the response, or you can type response.headers to see its headers.

The shell also instantiates two selectors, one for HTML (in the hxs variable) and one for XML (in the xxs variable) with this response. So let’s try them:

交互环境载入后,你将会有一个在本地变量 response 中提取的response , 所以如果你输入 response.body 你将会看到response的body部分,或者你可以输入 response.headers 来查看它的 headers.

交互环境也实例化了两种selectors, 一个是解析HTML的  hxs 变量 一个是解析 XML 的 xxs 变量 :

,这里使用到XPath选择器,使用HtmlXPathSelector,就能选择到相应的数据。
scrapy提供了一种相当好的方法(django里面也有类似的方法)给我们测试XPath,只要在terminal输入
  1. scrapy shell url #url表示你要提取网页的URL

就能进入一个交互模式,使用sel对象就能进行测试。

  1. hxs = HtmlXPathSelector(response)
  2. self.title = hxs.select('//title/text()').extract()[0].strip().replace(' ', '_')
  3. sites = hxs.select('//ul/li/div/a/img/@src').extract()
 
In [1]: hxs.select('//title')
Out[1]: [<HtmlXPathSelector (title) xpath=//title>] In [2]: hxs.select('//title').extract()
Out[2]: [u'<title>Open Directory - Computers: Programming: Languages: Python: Books</title>'] In [3]: hxs.select('//title/text()')
Out[3]: [<HtmlXPathSelector (text) xpath=//title/text()>] In [4]: hxs.select('//title/text()').extract()
Out[4]: [u'Open Directory - Computers: Programming: Languages: Python: Books'] In [5]: hxs.select('//title/text()').re('(\w+):')
Out[5]: [u'Computers', u'Programming', u'Languages', u'Python']

Extracting the data

Now, let’s try to extract some real information from those pages.

You could type response.body in the console, and inspect the source code to figure out the XPaths you need to use. However, inspecting the raw HTML code there could become a very tedious task. To make this an easier task, you can use some Firefox extensions like Firebug. For more information seeUsing Firebug for scraping and Using Firefox for scraping.

After inspecting the page source, you’ll find that the web sites information is inside a <ul> element, in fact the second <ul> element.

So we can select each <li> element belonging to the sites list with this code:

检查源代码后,你会发现我们需要的数据在一个 <ul>元素中 事实是第二个<ul>元素。

我们可以通过如下命令选择每个在网站中的 <li> 元素:

hxs.select('//ul/li')
And from them, the sites descriptions:

hxs.select('//ul/li/text()').extract()
The sites titles: hxs.select('//ul/li/a/text()').extract()
And the sites links: hxs.select('//ul/li/a/@href').extract()

As we said before, each select() call returns a list of selectors, so we can concatenate further select() calls to dig deeper into a node. We are going to use that property here, so:每个 select() 调用返回一个selectors列表, 所以我们可以结合 select() 调用去查找更深的节点. 我们将会用到这些特性,所以:

sites = hxs.select('//ul/li')
for site in sites:
title = site.select('a/text()').extract()
link = site.select('a/@href').extract()
desc = site.select('text()').extract()
print title, link, desc

Let’s add this code to our spider:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
] def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
for site in sites:
title = site.select('a/text()').extract()
link = site.select('a/@href').extract()
desc = site.select('text()').extract()
print title, link, desc

Now try crawling the dmoz.org domain again and you’ll see sites being printed in your output, run:

scrapy crawl dmoz

Using our item

Item objects are custom python dicts; you can access the values of their fields (attributes of the class we defined earlier) using the standard dict syntax like:

>>> item = DmozItem()
>>> item['title'] = 'Example title'
>>> item['title']
'Example title'

Spiders are expected to return their scraped data inside Item objects. So, in order to return the data we’ve scraped so far, the final code for our Spider would be like this:Spiders将会返回在 Item 中抓取的信息 ,所以为了返回我们抓取的信息,spider的内容应该是这样:

rom scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector from tutorial.items import DmozItem class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
] def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.select('a/text()').extract()
item['link'] = site.select('a/@href').extract()
item['desc'] = site.select('text()').extract()
items.append(item)
return items

发现上面有个错误,现在我的文件组织结构如下:

tutorial
--tutorial
----spiders
------__init__
------dmoz_spider ----__init__
----items
----pipelines
----setting

我最开始用

from tutorial.items import DmozItem 
pydev 提示错误,tutorial没有items模块。
改为from tutorial.tutorial.items import DmozItem pydev正常。
但是当我在cmd下scrapy crawl dmoz时就出错了。提示tutorial没有tutorial.items模块。
改为
from tutorial.items import DmozItem 正常。

Now doing a crawl on the dmoz.org domain yields DmozItem‘s:

[dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u' - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n],
'link': [u'http://gnosis.cx/TPiP/'],
'title': [u'Text Processing in Python']}
[dmoz] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u' - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n'],
'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'],
'title': [u'XML Processing with Python']}

Storing the scraped data

The simplest way to store the scraped data is by using the Feed exports, with the following command:

scrapy crawl dmoz -o items.json -t json

That will generate a items.json file containing all scraped items, serialized in JSON.

In small projects (like the one in this tutorial), that should be enough. However, if you want to perform more complex things with the scraped items, you can write an Item Pipeline. As with Items, a placeholder file for Item Pipelines has been set up for you when the project is created, in tutorial/pipelines.py. Though you don’t need to implement any item pipeline if you just want to store the scraped items.

Next steps

This tutorial covers only the basics of Scrapy, but there’s a lot of other features not mentioned here. Check the What else? section in Scrapy at a glancechapter for a quick overview of the most important ones.

Then, we recommend you continue by playing with an example project (see Examples), and then continue with the section Basic concepts.

说明;

0.24的scrapy版本

作了

以前是:

from scrapy.item import Item, Field
class DmozItem(Item):
# define the fields for your item here like:
# name = Field()
title=Field()
link=Field()
desc=Field()

现在是:

import scrapy

class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
做了类似的改动。

scrapy入门的更多相关文章

  1. [转]Scrapy入门教程

    关键字:scrapy 入门教程 爬虫 Spider 作者:http://www.cnblogs.com/txw1958/ 出处:http://www.cnblogs.com/txw1958/archi ...

  2. Scrapy入门教程

    关键字:scrapy 入门教程 爬虫 Spider作者:http://www.cnblogs.com/txw1958/出处:http://www.cnblogs.com/txw1958/archive ...

  3. scrapy入门使用

    scrapy入门 创建一个scrapy项目 scrapy startporject mySpider 生产一个爬虫 scrapy genspider itcast "itcast.cn&qu ...

  4. Scrapy入门教程(转)

    关键字:scrapy 入门教程 爬虫 Spider作者:http://www.cnblogs.com/txw1958/出处:http://www.cnblogs.com/txw1958/archive ...

  5. 0.Python 爬虫之Scrapy入门实践指南(Scrapy基础知识)

    目录 0.0.Scrapy基础 0.1.Scrapy 框架图 0.2.Scrapy主要包括了以下组件: 0.3.Scrapy简单示例如下: 0.4.Scrapy运行流程如下: 0.5.还有什么? 0. ...

  6. 2019-03-22 Python Scrapy 入门教程 笔记

    Python Scrapy 入门教程 入门教程笔记: # 创建mySpider scrapy startproject mySpider # 创建itcast.py cd C:\Users\theDa ...

  7. 小白学 Python 爬虫(34):爬虫框架 Scrapy 入门基础(二)

    人生苦短,我用 Python 前文传送门: 小白学 Python 爬虫(1):开篇 小白学 Python 爬虫(2):前置准备(一)基本类库的安装 小白学 Python 爬虫(3):前置准备(二)Li ...

  8. 小白学 Python 爬虫(35):爬虫框架 Scrapy 入门基础(三) Selector 选择器

    人生苦短,我用 Python 前文传送门: 小白学 Python 爬虫(1):开篇 小白学 Python 爬虫(2):前置准备(一)基本类库的安装 小白学 Python 爬虫(3):前置准备(二)Li ...

  9. 小白学 Python 爬虫(36):爬虫框架 Scrapy 入门基础(四) Downloader Middleware

    人生苦短,我用 Python 前文传送门: 小白学 Python 爬虫(1):开篇 小白学 Python 爬虫(2):前置准备(一)基本类库的安装 小白学 Python 爬虫(3):前置准备(二)Li ...

  10. 小白学 Python 爬虫(37):爬虫框架 Scrapy 入门基础(五) Spider Middleware

    人生苦短,我用 Python 前文传送门: 小白学 Python 爬虫(1):开篇 小白学 Python 爬虫(2):前置准备(一)基本类库的安装 小白学 Python 爬虫(3):前置准备(二)Li ...

随机推荐

  1. 一劳永逸让windows 64位操作系统 禁止强制驱动签名

    如何让WINDOWS7 64位直接加载“禁用强制驱动程序签名”方式启动  Windows Client 论坛 > Windows 7 问题 0 登录进行投票 因为开发需要,要装一台设备的驱动,但 ...

  2. URL中#(井号)的作用(转)

    2010年9月,twitter改版. 一个显著变化,就是URL加入了"#!"符号.比如,改版前的用户主页网址为 http://twitter.com/username 改版后,就变 ...

  3. docpad建站记录

    记一下用docpad建站的过程作为备忘.不定时更新 why docpad wordpress对我来说太过于臃肿,我就想要个代码干净的小站来写东西.想要个markdown为基础的静态站. 比较流行的St ...

  4. js动态加载控件jsp页面

    例子1:(具体参照drp中的flow_card_add.jsp)<script>    var rowIndex = 0;     function addOneLineOnClick() ...

  5. Silverlight CheckBoxList

    项目要用到复选框,可是在Silverlight中不存在CheckBoxList.通过查阅资料以及依据自己的理解,写了简单演示样例: 1.XAML <UserControl x:Class=&qu ...

  6. SQL 语言划分

    从功能上划分,SQL 语言能够分为DDL,DML和DCL三大类. 1. DDL(Data Definition Language)     数据定义语言,用于定义和管理 SQL 数据库中的全部对象的语 ...

  7. xls与csv文件区别?

    xls 文件就是Microsoft excel电子表格的文件格式.CSV是最通用的一种文件格式,它可以非常容易地被导入各种PC表格及数据库中. 此文件,一行即为数据表的一行.生成数据表字段用半角逗号隔 ...

  8. javascripts小结

    1 NAN-isNaN():判断是否数值 2 数值转换 Number()-任何数据类型,parseInt(),parseFloat()-字符串 3数组转字符串 var a=["red&quo ...

  9. NotePad++ 配置C/C++编译环境

    如果只是测试小程序可以用这种方法 比较方便,如果对于大程序建议使用专业的IDE. 经常需要写一些小程序来运行,又不想运行Visual Studio.Eclipse这样的环境,而Notepad++是一个 ...

  10. spring多数据源的配置

    C3P0和DBCP的区别 C3P0是一个开源的JDBC连接池,它实现了数据源和JNDI绑定,支持JDBC3规范和JDBC2的标准扩展.目前使用它的开源项目有Hibernate,Spring等.   d ...