Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 其可以应用在数据挖掘,信息处理或存储历史数据等一系列的程序中。
其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的, 也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。

Scrapy 使用了 Twisted异步网络库来处理网络通讯。整体架构大致如下

Scrapy主要包括了以下组件:

  • 引擎(Scrapy)
    用来处理整个系统的数据流处理, 触发事务(框架核心)
  • 调度器(Scheduler)
    用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
  • 下载器(Downloader)
    用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
  • 爬虫(Spiders)
    爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
  • 项目管道(Pipeline)
    负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。
  • 下载器中间件(Downloader Middlewares)
    位于Scrapy引擎和下载器之间的框架,主要是处理Scrapy引擎与下载器之间的请求及响应。
  • 爬虫中间件(Spider Middlewares)
    介于Scrapy引擎和爬虫之间的框架,主要工作是处理蜘蛛的响应输入和请求输出。
  • 调度中间件(Scheduler Middewares)
    介于Scrapy引擎和调度之间的中间件,从Scrapy引擎发送到调度的请求和响应。

Scrapy运行流程大概如下:

    1. 引擎从调度器中取出一个链接(URL)用于接下来的抓取
    2. 引擎把URL封装成一个请求(Request)传给下载器
    3. 下载器把资源下载下来,并封装成应答包(Response)
    4. 爬虫解析Response
    5. 解析出实体(Item),则交给实体管道进行进一步的处理
    6. 解析出的是链接(URL),则把URL交给调度器等待抓取

linux系统

pip3 install scrapy

Windows系统

#scrapy 的一些依赖:pywin32、pyOpenSSL、Twisted、lxml 、zope.interface。(安装的时候,注意看报错信息)

#安装wheel
pip3 install wheel-i http://pypi.douban.com/simple --trusted-host pypi.douban.com


#安装这个依赖包,才有安装上Twisted
pip3 install Incremental -i http://pypi.douban.com/simple --trusted-host pypi.douban.com


#再pip3安装Twisted,但是还是安装不成功,会报错。(解决其它依赖问题)
pip3 install Twisted -i http://pypi.douban.com/simple --trusted-host pypi.douban.com


#再进入软件存放目录,再安装就可以成功啦。
pip3 install Twisted-17.1.0-cp35-cp35m-win32.whl


#安装scrapy
pip3 install scrapy -i http://pypi.douban.com/simple --trusted-host pypi.douban.com


#pywin32
下载:https://sourceforge.net/projects/pywin32/files/

检查pywin32是否安装成功。

C:\Users\Administrator>python
Python 3.5. (v3.5.2:4def2a2901a5, Jun , ::) [MSC v. bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information. >>> import win32api
>>> import win32con
>>> win32api.MessageBox(win32con.NULL, 'Python 你好!', '你好', win32con.MB_OK)

二、基本使用

1. 基本命令

#创建项目
scrapy startproject xiaohuar #进入项目
cd xiaohuar #创建爬虫应用
scrapy genspider xiaohuar xiaohar.com #运行爬虫
scrapy crawl chouti --nolog

2.项目结构以及爬虫应用简介

文件说明:
scrapy.cfg 项目的主配置信息。(真正爬虫相关的配置信息在settings.py文件中)
items.py 设置数据存储模板,用于结构化数据,如:Django的Model
pipelines 数据处理行为,如:一般结构化的数据持久化
settings.py 配置文件,如:递归的层数、并发数,延迟下载等
spiders 爬虫目录,如:创建文件,编写爬虫规则

注意:一般创建爬虫文件时,以网站域名命名

import scrapy

class XiaoHuarSpider(scrapy.spiders.Spider):
name = "xiaohuar" # 爬虫名称 *****
allowed_domains = ["xiaohuar.com"] # 允许的域名
start_urls = [
"http://www.xiaohuar.com/hua/", # 其实URL
] def parse(self, response):
# 访问起始URL并获取结果后的回调函数 爬虫1.py

3、找标签方法

# -*- coding: utf- -*-
import scrapy
import sys
import io
from scrapy.http import Request
from scrapy.selector import Selector, HtmlXPathSelector
from ..items import ChoutiItem sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') class ChoutiSpider(scrapy.Spider):
name = "chouti"
allowed_domains = ["chouti.com"]
start_urls = ['http://dig.chouti.com/'] visited_urls =set()
# def start_requests(self):
# for url in self.start_urls:
# yield Request(url,callback=self.parse) def parse(self, response):
# content = str(response.body,encoding='utf-8')
# 找到文档中所有A标签
# hxs = Selector(response=response).xpath('//a') # 标签对象列表
# for i in hxs:
# print(i) # 标签对象 # 对象转换为字符串
# hxs = Selector(response=response).xpath('//div[@id="content-list"]/div[@class="item"]').extract() # 标签对象列表
# hxs = Selector(response=response).xpath('//div[@id="content-list"]/div[@class="item"]') # 标签对象列表
# for obj in hxs:
# a = obj.xpath('.//a[@class="show-content"]/text()').extract_first()
# print(a.strip())
# 选择器:
"""
// 表示子孙中
.// 当前对象的子孙中
/ 儿子
/div 儿子中的div标签
/div[@id="i1"] #儿子中的div标签且id=i1
/div[@id="i1"] #儿子中的div标签且id=i1
obj.extract() # 列表中的每一个对象转换字符串 =》 []
obj.extract_first() # 列表中的每一个对象转换字符串 => 列表第一个元素
//div/text() 获取某个标签的文本 """ # 获取当前页的所有页码 # hxs = Selector(response=response).xpath('//div[@id="dig_lcpage"]//a/text()')
# hxs = Selector(response=response).xpath('//div[@id="dig_lcpage"]//a/@href').extract()
# hxs = Selector(response=response).xpath('//a[starts-with(@href, "/all/hot/recent/")]/@href').extract() # response
hxs1 = Selector(response=response).xpath('//div[@id="content-list"]/div[@class="item"]') # 标签对象列表
for obj in hxs1:
title = obj.xpath('.//a[@class="show-content"]/text()').extract_first().strip()
href = obj.xpath('.//a[@class="show-content"]/@href').extract_first().strip()
item_obj = ChoutiItem(title=title,href=href) # 将item对象传递给pipeline
yield item_obj hxs2 = Selector(response=response).xpath('//a[re:test(@href, "/all/hot/recent/\d+")]/@href').extract()
for url in hxs2:
md5_url = self.md5(url)
if md5_url in self.visited_urls:
pass
else:
self.visited_urls.add(md5_url)
url = "http://dig.chouti.com%s" %url
# 将新要访问的url添加到调度器
yield Request(url=url,callback=self.parse)
# a/@href 获取属性
# //a[starts-with(@href, "/all/hot/recent/")]/@href' 已xx开始
# //a[re:test(@href, "/all/hot/recent/\d+")] 正则
# yield Request(url=url,callback=self.parse) # 将新要访问的url添加到调度器
# 重写start_requests,指定最开始处理请求的方法 # def show(self,response):
# print(response.text) def md5(self,url):
import hashlib
obj = hashlib.md5()
obj.update(bytes(url,encoding='utf-8'))
return obj.hexdigest()

3. 小试牛刀

import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request class DigSpider(scrapy.Spider):
# 爬虫应用的名称,通过此名称启动爬虫命令
name = "dig" # 允许的域名
allowed_domains = ["chouti.com"] # 起始URL
start_urls = [
'http://dig.chouti.com/',
] has_request_set = {} def parse(self, response):
print(response.url) hxs = HtmlXPathSelector(response)
page_list = hxs.select('//div[@id="dig_lcpage"]//a[re:test(@href, "/all/hot/recent/\d+")]/@href').extract()
for page in page_list:
page_url = 'http://dig.chouti.com%s' % page
key = self.md5(page_url)
if key in self.has_request_set:
pass
else:
self.has_request_set[key] = page_url
obj = Request(url=page_url, method='GET', callback=self.parse)
yield obj @staticmethod
def md5(val):
import hashlib
ha = hashlib.md5()
ha.update(bytes(val, encoding='utf-8'))
key = ha.hexdigest()
return key

执行此爬虫文件,则在终端进入项目目录执行如下命令:

1
scrapy crawl dig --nolog

对于上述代码重要之处在于:

  • Request是一个封装用户请求的类,在回调函数中yield该对象表示继续访问
  • HtmlXpathSelector用于结构化HTML代码并提供选择器功能

4、选择器

#!/usr/bin/env python
# -*- coding:utf- -*-
from scrapy.selector import Selector, HtmlXPathSelector
from scrapy.http import HtmlResponse
html = """<!DOCTYPE html>
<html>
<head lang="en">
<meta charset="UTF-8">
<title></title>
</head>
<body>
<ul>
<li class="item-"><a id='i1' href="link.html">first item</a></li>
<li class="item-0"><a id='i2' href="llink.html">first item</a></li>
<li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li>
</ul>
<div><a href="llink2.html">second item</a></div>
</body>
</html>
"""
response = HtmlResponse(url='http://example.com', body=html,encoding='utf-8')
# hxs = HtmlXPathSelector(response)
# print(hxs)
# hxs = Selector(response=response).xpath('//a') #找到a标签
# print(hxs)
# hxs = Selector(response=response).xpath('//a[2]') #找到列表中的第2个
# print(hxs)
# hxs = Selector(response=response).xpath('//a[@id]') #找到有a标签的属性
# print(hxs)
# hxs = Selector(response=response).xpath('//a[@id="i1"]') #找到ID=他的值
# print(hxs)
# hxs = Selector(response=response).xpath('//a[@href="link.html"][@id="i1"]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[contains(@href, "link")]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[starts-with(@href, "link")]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]') #正测表达式
# print(hxs)
# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/text()').extract()
# print(hxs)
# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/@href').extract()
# print(hxs)
# hxs = Selector(response=response).xpath('/html/body/ul/li/a/@href').extract()
# print(hxs)
# hxs = Selector(response=response).xpath('//body/ul/li/a/@href').extract_first()
# print(hxs) # ul_list = Selector(response=response).xpath('//body/ul/li')
# for item in ul_list:
# v = item.xpath('./a/span')
# # 或
# # v = item.xpath('a/span')
# # 或
# # v = item.xpath('*/a/span')
# print(v)

示例:自动登陆抽屉并点赞

# -*- coding: utf- -*-
import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from scrapy.http.cookies import CookieJar
from scrapy import FormRequest class ChouTiSpider(scrapy.Spider):
# 爬虫应用的名称,通过此名称启动爬虫命令
name = "chouti"
# 允许的域名
allowed_domains = ["chouti.com"] cookie_dict = {}
has_request_set = {} def start_requests(self):
url = 'http://dig.chouti.com/'
# return [Request(url=url, callback=self.login)]
yield Request(url=url, callback=self.login) def login(self, response):
cookie_jar = CookieJar()
cookie_jar.extract_cookies(response, response.request)
for k, v in cookie_jar._cookies.items():
for i, j in v.items():
for m, n in j.items():
self.cookie_dict[m] = n.value req = Request(
url='http://dig.chouti.com/login',
method='POST',
headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
body='phone=8615131255089&password=pppppppp&oneMonth=1',
cookies=self.cookie_dict,
callback=self.check_login
)
yield req def check_login(self, response):
req = Request(
url='http://dig.chouti.com/',
method='GET',
callback=self.show,
cookies=self.cookie_dict,
dont_filter=True
)
yield req def show(self, response):
# print(response)
hxs = HtmlXPathSelector(response)
news_list = hxs.select('//div[@id="content-list"]/div[@class="item"]')
for new in news_list:
# temp = new.xpath('div/div[@class="part2"]/@share-linkid').extract()
link_id = new.xpath('*/div[@class="part2"]/@share-linkid').extract_first()
yield Request(
url='http://dig.chouti.com/link/vote?linksId=%s' %(link_id,),
method='POST',
cookies=self.cookie_dict,
callback=self.do_favor
) page_list = hxs.select('//div[@id="dig_lcpage"]//a[re:test(@href, "/all/hot/recent/\d+")]/@href').extract()
for page in page_list: page_url = 'http://dig.chouti.com%s' % page
import hashlib
hash = hashlib.md5()
hash.update(bytes(page_url,encoding='utf-8'))
key = hash.hexdigest()
if key in self.has_request_set:
pass
else:
self.has_request_set[key] = page_url
yield Request(
url=page_url,
method='GET',
callback=self.show
) def do_favor(self, response):
print(response.text)

注意:settings.py中设置DEPTH_LIMIT = 1来指定“递归”的层数。

5. 格式化处理

上述实例只是简单的处理,所以在parse方法中直接处理。如果对于想要获取更多的数据处理,则可以利用Scrapy的items将数据格式化,然后统一交由pipelines来处理。

spiders/xiahuar.py

import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from scrapy.http.cookies import CookieJar
from scrapy import FormRequest class XiaoHuarSpider(scrapy.Spider):
# 爬虫应用的名称,通过此名称启动爬虫命令
name = "xiaohuar"
# 允许的域名
allowed_domains = ["xiaohuar.com"] start_urls = [
"http://www.xiaohuar.com/list-1-1.html",
]
# custom_settings = {
# 'ITEM_PIPELINES':{
# 'spider1.pipelines.JsonPipeline':
# }
# }
has_request_set = {} def parse(self, response):
# 分析页面
# 找到页面中符合规则的内容(校花图片),保存
# 找到所有的a标签,再访问其他a标签,一层一层的搞下去 hxs = HtmlXPathSelector(response) items = hxs.select('//div[@class="item_list infinite_scroll"]/div')
for item in items:
src = item.select('.//div[@class="img"]/a/img/@src').extract_first()
name = item.select('.//div[@class="img"]/span/text()').extract_first()
school = item.select('.//div[@class="img"]/div[@class="btns"]/a/text()').extract_first()
url = "http://www.xiaohuar.com%s" % src
from ..items import XiaoHuarItem
obj = XiaoHuarItem(name=name, school=school, url=url)
yield obj urls = hxs.select('//a[re:test(@href, "http://www.xiaohuar.com/list-1-\d+.html")]/@href')
for url in urls:
key = self.md5(url)
if key in self.has_request_set:
pass
else:
self.has_request_set[key] = url
req = Request(url=url,method='GET',callback=self.parse)
yield req @staticmethod
def md5(val):
import hashlib
ha = hashlib.md5()
ha.update(bytes(val, encoding='utf-8'))
key = ha.hexdigest()
return key

items

import scrapy

class XiaoHuarItem(scrapy.Item):
name = scrapy.Field()
school = scrapy.Field()
url = scrapy.Field()

pipelines

import json
import os
import requests class JsonPipeline(object):
def __init__(self):
self.file = open('xiaohua.txt', 'w') def process_item(self, item, spider):
v = json.dumps(dict(item), ensure_ascii=False)
self.file.write(v)
self.file.write('\n')
self.file.flush()
return item class FilePipeline(object):
def __init__(self):
if not os.path.exists('imgs'):
os.makedirs('imgs') def process_item(self, item, spider):
response = requests.get(item['url'], stream=True)
file_name = '%s_%s.jpg' % (item['name'], item['school'])
with open(os.path.join('imgs', file_name), mode='wb') as f:
f.write(response.content)
return item

settings

ITEM_PIPELINES = {
'spider1.pipelines.JsonPipeline': ,
'spider1.pipelines.FilePipeline': ,
}
# 每行后面的整型值,确定了他们运行的顺序,item按数字从低到高的顺序,通过pipeline,通常将这些数字定义在0-1000范围内。

自定义pipeline

from scrapy.exceptions import DropItem

class CustomPipeline(object):
def __init__(self,v):
self.value = v def process_item(self, item, spider):
# 操作并进行持久化 # return表示会被后续的pipeline继续处理
return item # 表示将item丢弃,不会被后续pipeline处理
# raise DropItem() @classmethod
def from_crawler(cls, crawler):
"""
初始化时候,用于创建pipeline对象
:param crawler:
:return:
"""
val = crawler.settings.getint('MMMM')
return cls(val) def open_spider(self,spider):
"""
爬虫开始执行时,调用
:param spider:
:return:
"""
print('') def close_spider(self,spider):
"""
爬虫关闭时,被调用
:param spider:
:return:
"""
print('')

6.中间件

爬虫中间件

class SpiderMiddleware(object):

    def process_spider_input(self,response, spider):
"""
下载完成,执行,然后交给parse处理
:param response:
:param spider:
:return:
"""
pass def process_spider_output(self,response, result, spider):
"""
spider处理完成,返回时调用
:param response:
:param result:
:param spider:
:return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)
"""
return result def process_spider_exception(self,response, exception, spider):
"""
异常调用
:param response:
:param exception:
:param spider:
:return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline
"""
return None def process_start_requests(self,start_requests, spider):
"""
爬虫启动时调用
:param start_requests:
:param spider:
:return: 包含 Request 对象的可迭代对象
"""
return start_requests

下载器中间件

class DownMiddleware1(object):
def process_request(self, request, spider):
"""
请求需要被下载时,经过所有下载器中间件的process_request调用
:param request:
:param spider:
:return:
None,继续后续中间件去下载;
Response对象,停止process_request的执行,开始执行process_response
Request对象,停止中间件的执行,将Request重新调度器
raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception
"""
pass def process_response(self, request, response, spider):
"""
spider处理完成,返回时调用
:param response:
:param result:
:param spider:
:return:
Response 对象:转交给其他中间件process_response
Request 对象:停止中间件,request会被重新调度下载
raise IgnoreRequest 异常:调用Request.errback
"""
print('response1')
return response def process_exception(self, request, exception, spider):
"""
当下载处理器(download handler)或 process_request() (下载中间件)抛出异常
:param response:
:param exception:
:param spider:
:return:
None:继续交给后续中间件处理异常;
Response对象:停止后续process_exception方法
Request对象:停止中间件,request将会被重新调用下载
"""
return None

7. 自定制命令

  • 在spiders同级创建任意目录,如:commands
  • 在其中创建 crawlall.py 文件 (此处文件名就是自定义的命令)

crawlall.py

from scrapy.commands import ScrapyCommand
from scrapy.utils.project import get_project_settings class Command(ScrapyCommand): requires_project = True def syntax(self):
return '[options]' def short_desc(self):
return 'Runs all of the spiders' def run(self, args, opts):
spider_list = self.crawler_process.spiders.list()
for name in spider_list:
self.crawler_process.crawl(name, **opts.__dict__)
self.crawler_process.start()
  • 在settings.py 中添加配置 COMMANDS_MODULE = '项目名称.目录名称'
  • 在项目目录执行命令:scrapy crawlall

8. 自定义扩展

自定义扩展时,利用信号在指定位置注册制定操作

from scrapy import signals

class MyExtension(object):
def __init__(self, value):
self.value = value @classmethod
def from_crawler(cls, crawler):
val = crawler.settings.getint('MMMM')
ext = cls(val) crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed) return ext def spider_opened(self, spider):
print('open') def spider_closed(self, spider):
print('close')

9. 避免重复访问

scrapy默认使用 scrapy.dupefilter.RFPDupeFilter 进行去重,相关配置有:

1
2
3
DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
DUPEFILTER_DEBUG = False
JOBDIR = "保存范文记录的日志路径,如:/root/"  # 最终路径为 /root/requests.seen

自定义URL去重操作

class RepeatUrl:
def __init__(self):
self.visited_url = set() @classmethod
def from_settings(cls, settings):
"""
初始化时,调用
:param settings:
:return:
"""
return cls() def request_seen(self, request):
"""
检测当前请求是否已经被访问过
:param request:
:return: True表示已经访问过;False表示未访问过
"""
if request.url in self.visited_url:
return True
self.visited_url.add(request.url)
return False def open(self):
"""
开始爬去请求时,调用
:return:
"""
print('open replication') def close(self, reason):
"""
结束爬虫爬取时,调用
:param reason:
:return:
"""
print('close replication') def log(self, request, spider):
"""
记录日志
:param request:
:param spider:
:return:
"""
print('repeat', request.url)

 10.其他

settings一些设置参数说明:

# -*- coding: utf- -*-

# Scrapy settings for step8_king project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html # . 爬虫名称
BOT_NAME = 'step8_king' # . 爬虫应用路径
SPIDER_MODULES = ['step8_king.spiders']
NEWSPIDER_MODULE = 'step8_king.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent
# . 客户端 user-agent请求头
# USER_AGENT = 'step8_king (+http://www.yourdomain.com)' # Obey robots.txt rules
# . 禁止爬虫配置
# ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: )
# . 并发请求数
# CONCURRENT_REQUESTS = # Configure a delay for requests for the same website (default: )
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# . 延迟下载秒数
# DOWNLOAD_DELAY = # The download delay setting will honor only one of:
# . 单域名访问并发数,并且延迟下次秒数也应用在每个域名
# CONCURRENT_REQUESTS_PER_DOMAIN =
# 单IP访问并发数,如果有值则忽略:CONCURRENT_REQUESTS_PER_DOMAIN,并且延迟下次秒数也应用在每个IP
# CONCURRENT_REQUESTS_PER_IP = # Disable cookies (enabled by default)
# . 是否支持cookie,cookiejar进行操作cookie
# COOKIES_ENABLED = True
# COOKIES_DEBUG = True # Disable Telnet Console (enabled by default)
# . Telnet用于查看当前爬虫的信息,操作爬虫等...
# 使用telnet ip port ,然后通过命令操作
# TELNETCONSOLE_ENABLED = True
# TELNETCONSOLE_HOST = '127.0.0.1'
# TELNETCONSOLE_PORT = [,] # . 默认请求头
# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
# } # Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
# . 定义pipeline处理请求
# ITEM_PIPELINES = {
# 'step8_king.pipelines.JsonPipeline': ,
# 'step8_king.pipelines.FilePipeline': ,
# } # . 自定义扩展,基于信号进行调用
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
# EXTENSIONS = {
# # 'step8_king.extensions.MyExtension': ,
# } # . 爬虫允许的最大深度,可以通过meta查看当前深度;0表示无深度
# DEPTH_LIMIT = # . 爬取时,0表示深度优先Lifo(默认);1表示广度优先FiFo # 后进先出,深度优先
# DEPTH_PRIORITY =
# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue'
# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue'
# 先进先出,广度优先 # DEPTH_PRIORITY =
# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue' # . 调度器队列
# SCHEDULER = 'scrapy.core.scheduler.Scheduler'
# from scrapy.core.scheduler import Scheduler # . 访问URL去重
# DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl' # Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html """
. 自动限速算法
from scrapy.contrib.throttle import AutoThrottle
自动限速设置
. 获取最小延迟 DOWNLOAD_DELAY
. 获取最大延迟 AUTOTHROTTLE_MAX_DELAY
. 设置初始下载延迟 AUTOTHROTTLE_START_DELAY
. 当请求下载完成后,获取其"连接"时间 latency,即:请求连接到接受到响应头之间的时间
. 用于计算的... AUTOTHROTTLE_TARGET_CONCURRENCY
target_delay = latency / self.target_concurrency
new_delay = (slot.delay + target_delay) / 2.0 # 表示上一次的延迟时间
new_delay = max(target_delay, new_delay)
new_delay = min(max(self.mindelay, new_delay), self.maxdelay)
slot.delay = new_delay
""" # 开始自动限速
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# 初始下载延迟
# AUTOTHROTTLE_START_DELAY =
# The maximum download delay to be set in case of high latencies
# 最大下载延迟
# AUTOTHROTTLE_MAX_DELAY =
# The average number of requests Scrapy should be sending in parallel to each remote server
# 平均每秒并发数
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received:
# 是否显示
# AUTOTHROTTLE_DEBUG = True # Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings """
. 启用缓存
目的用于将已经发送的请求或相应缓存下来,以便以后使用 from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddleware
from scrapy.extensions.httpcache import DummyPolicy
from scrapy.extensions.httpcache import FilesystemCacheStorage
"""
# 是否启用缓存策略
# HTTPCACHE_ENABLED = True # 缓存策略:所有请求均缓存,下次在请求直接访问原来的缓存即可
# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy"
# 缓存策略:根据Http响应头:Cache-Control、Last-Modified 等进行缓存的策略
# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy" # 缓存超时时间
# HTTPCACHE_EXPIRATION_SECS = # 缓存保存路径
# HTTPCACHE_DIR = 'httpcache' # 缓存忽略的Http状态码
# HTTPCACHE_IGNORE_HTTP_CODES = [] # 缓存存储的插件
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' """
. 代理,需要在环境变量中设置
from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware 方式一:使用默认
os.environ
{
http_proxy:http://root:woshiniba@192.168.11.11:9999/
https_proxy:http://192.168.11.11:9999/
}
方式二:使用自定义下载中间件 def to_bytes(text, encoding=None, errors='strict'):
if isinstance(text, bytes):
return text
if not isinstance(text, six.string_types):
raise TypeError('to_bytes must receive a unicode, str or bytes '
'object, got %s' % type(text).__name__)
if encoding is None:
encoding = 'utf-8'
return text.encode(encoding, errors) class ProxyMiddleware(object):
def process_request(self, request, spider):
PROXIES = [
{'ip_port': '111.11.228.75:80', 'user_pass': ''},
{'ip_port': '120.198.243.22:80', 'user_pass': ''},
{'ip_port': '111.8.60.9:8123', 'user_pass': ''},
{'ip_port': '101.71.27.120:80', 'user_pass': ''},
{'ip_port': '122.96.59.104:80', 'user_pass': ''},
{'ip_port': '122.224.249.122:8088', 'user_pass': ''},
]
proxy = random.choice(PROXIES)
if proxy['user_pass'] is not None:
request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass']))
request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)
print "**************ProxyMiddleware have pass************" + proxy['ip_port']
else:
print "**************ProxyMiddleware no pass************" + proxy['ip_port']
request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port']) DOWNLOADER_MIDDLEWARES = {
'step8_king.middlewares.ProxyMiddleware': ,
} """ """
. Https访问
Https访问时有两种情况:
. 要爬取网站使用的可信任证书(默认支持)
DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
DOWNLOADER_CLIENTCONTEXTFACTORY = "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory" . 要爬取网站使用的自定义证书
DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
DOWNLOADER_CLIENTCONTEXTFACTORY = "step8_king.https.MySSLFactory" # https.py
from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory
from twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate) class MySSLFactory(ScrapyClientContextFactory):
def getCertificateOptions(self):
from OpenSSL import crypto
v1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.key.unsecure', mode='r').read())
v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.pem', mode='r').read())
return CertificateOptions(
privateKey=v1, # pKey对象
certificate=v2, # X509对象
verify=False,
method=getattr(self, 'method', getattr(self, '_ssl_method', None))
)
其他:
相关类
scrapy.core.downloader.handlers.http.HttpDownloadHandler
scrapy.core.downloader.webclient.ScrapyHTTPClientFactory
scrapy.core.downloader.contextfactory.ScrapyClientContextFactory
相关配置
DOWNLOADER_HTTPCLIENTFACTORY
DOWNLOADER_CLIENTCONTEXTFACTORY """ """
. 爬虫中间件
class SpiderMiddleware(object): def process_spider_input(self,response, spider):
'''
下载完成,执行,然后交给parse处理
:param response:
:param spider:
:return:
'''
pass def process_spider_output(self,response, result, spider):
'''
spider处理完成,返回时调用
:param response:
:param result:
:param spider:
:return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)
'''
return result def process_spider_exception(self,response, exception, spider):
'''
异常调用
:param response:
:param exception:
:param spider:
:return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline
'''
return None def process_start_requests(self,start_requests, spider):
'''
爬虫启动时调用
:param start_requests:
:param spider:
:return: 包含 Request 对象的可迭代对象
'''
return start_requests 内置爬虫中间件:
'scrapy.contrib.spidermiddleware.httperror.HttpErrorMiddleware': ,
'scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware': ,
'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': ,
'scrapy.contrib.spidermiddleware.urllength.UrlLengthMiddleware': ,
'scrapy.contrib.spidermiddleware.depth.DepthMiddleware': , """
# from scrapy.contrib.spidermiddleware.referer import RefererMiddleware
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
# 'step8_king.middlewares.SpiderMiddleware': ,
} """
. 下载中间件
class DownMiddleware1(object):
def process_request(self, request, spider):
'''
请求需要被下载时,经过所有下载器中间件的process_request调用
:param request:
:param spider:
:return:
None,继续后续中间件去下载;
Response对象,停止process_request的执行,开始执行process_response
Request对象,停止中间件的执行,将Request重新调度器
raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception
'''
pass def process_response(self, request, response, spider):
'''
spider处理完成,返回时调用
:param response:
:param result:
:param spider:
:return:
Response 对象:转交给其他中间件process_response
Request 对象:停止中间件,request会被重新调度下载
raise IgnoreRequest 异常:调用Request.errback
'''
print('response1')
return response def process_exception(self, request, exception, spider):
'''
当下载处理器(download handler)或 process_request() (下载中间件)抛出异常
:param response:
:param exception:
:param spider:
:return:
None:继续交给后续中间件处理异常;
Response对象:停止后续process_exception方法
Request对象:停止中间件,request将会被重新调用下载
'''
return None 默认下载中间件
{
'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware': ,
'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware': ,
'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware': ,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': ,
'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': ,
'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware': ,
'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': ,
'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': ,
'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': ,
'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': ,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': ,
'scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware': ,
'scrapy.contrib.downloadermiddleware.stats.DownloaderStats': ,
'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': ,
} """
# from scrapy.contrib.downloadermiddleware.httpauth import HttpAuthMiddleware
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {
# 'step8_king.middlewares.DownMiddleware1': ,
# 'step8_king.middlewares.DownMiddleware2': ,
# }

11.TinyScrapy

#!/usr/bin/env python
# -*- coding:utf- -*-
import types
from twisted.internet import defer
from twisted.web.client import getPage
from twisted.internet import reactor class Request(object):
def __init__(self, url, callback):
self.url = url
self.callback = callback
self.priority = class HttpResponse(object):
def __init__(self, content, request):
self.content = content
self.request = request class ChouTiSpider(object): def start_requests(self):
url_list = ['http://www.cnblogs.com/', 'http://www.bing.com']
for url in url_list:
yield Request(url=url, callback=self.parse) def parse(self, response):
print(response.request.url)
# yield Request(url="http://www.baidu.com", callback=self.parse) from queue import Queue
Q = Queue() class CallLaterOnce(object):
def __init__(self, func, *a, **kw):
self._func = func
self._a = a
self._kw = kw
self._call = None def schedule(self, delay=):
if self._call is None:
self._call = reactor.callLater(delay, self) def cancel(self):
if self._call:
self._call.cancel() def __call__(self):
self._call = None
return self._func(*self._a, **self._kw) class Engine(object):
def __init__(self):
self.nextcall = None
self.crawlling = []
self.max =
self._closewait = None def get_response(self,content, request):
response = HttpResponse(content, request)
gen = request.callback(response)
if isinstance(gen, types.GeneratorType):
for req in gen:
req.priority = request.priority +
Q.put(req) def rm_crawlling(self,response,d):
self.crawlling.remove(d) def _next_request(self,spider):
if Q.qsize() == and len(self.crawlling) == :
self._closewait.callback(None) if len(self.crawlling) >= :
return
while len(self.crawlling) < :
try:
req = Q.get(block=False)
except Exception as e:
req = None
if not req:
return
d = getPage(req.url.encode('utf-8'))
self.crawlling.append(d)
d.addCallback(self.get_response, req)
d.addCallback(self.rm_crawlling,d)
d.addCallback(lambda _: self.nextcall.schedule()) @defer.inlineCallbacks
def crawl(self):
spider = ChouTiSpider()
start_requests = iter(spider.start_requests())
flag = True
while flag:
try:
req = next(start_requests)
Q.put(req)
except StopIteration as e:
flag = False self.nextcall = CallLaterOnce(self._next_request,spider)
self.nextcall.schedule() self._closewait = defer.Deferred()
yield self._closewait @defer.inlineCallbacks
def pp(self):
yield self.crawl() _active = set()
obj = Engine()
d = obj.crawl()
_active.add(d) li = defer.DeferredList(_active)
li.addBoth(lambda _,*a,**kw: reactor.stop()) reactor.run()

点击下载

更多文档参见:http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html

python爬虫之scrapy框架的更多相关文章

  1. Python爬虫进阶(Scrapy框架爬虫)

    准备工作:           配置环境问题什么的我昨天已经写了,那么今天直接安装三个库                        首先第一步:                           ...

  2. python爬虫随笔-scrapy框架(1)——scrapy框架的安装和结构介绍

    scrapy框架简介 Scrapy,Python开发的一个快速.高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据.Scrapy用途广泛,可以用于数据挖掘.监测和自动化测试 ...

  3. python爬虫之scrapy框架介绍

    一.什么是Scrapy? Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架,非常出名,非常强悍.所谓的框架就是一个已经被集成了各种功能(高性能异步下载,队列,分布式,解析,持久化等) ...

  4. python爬虫之Scrapy框架(CrawlSpider)

    提问:如果想要通过爬虫程序去爬取”糗百“全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬去进行实现的(Request模块回调) 方法二:基于CrawlSpi ...

  5. (转)python爬虫----(scrapy框架提高(1),自定义Request爬取)

    摘要 之前一直使用默认的parse入口,以及SgmlLinkExtractor自动抓取url.但是一般使用的时候都是需要自己写具体的url抓取函数的. python 爬虫 scrapy scrapy提 ...

  6. python爬虫使用scrapy框架

    scrapy框架提升篇 关注公众号"轻松学编程"了解更多 1.创建启动爬虫脚本 在项目目录下创建start.py文件: 添加代码: #以后只要运行start.py就可以启动爬虫 i ...

  7. Python 爬虫之Scrapy框架

    Scrapy框架架构 Scrapy框架介绍: 写一个爬虫,需要做很多的事情.比如:发送网络请求.数据解析.数据存储.反反爬虫机制(更换ip代理.设置请求头等).异步请求等.这些工作如果每次都要自己从零 ...

  8. python爬虫中scrapy框架是否安装成功及简单创建

    判断框架是否安装成功,在新建的爬虫文件夹下打开盘符中框输入cmd,在命令中输入scrapy,若显示如下图所示,则说明成功安装爬虫框架: 查看当前版本:在刚刚打开的命令框内输入scrapy versio ...

  9. Python网络爬虫之Scrapy框架(CrawlSpider)

    目录 Python网络爬虫之Scrapy框架(CrawlSpider) CrawlSpider使用 爬取糗事百科糗图板块的所有页码数据 Python网络爬虫之Scrapy框架(CrawlSpider) ...

随机推荐

  1. Xshell 找到上次执行的命令

    ctrl + p   返回上一次输入命令字符 ctrl + r       输入单词搜索历史命令

  2. 常用模块二(hashlib、configparser、logging)

    阅读目录 常用模块二 hashlib模块 configparse模块 logging模块   常用模块二 返回顶部 hashlib模块 Python的hashlib提供了常见的摘要算法,如MD5,SH ...

  3. SEO编辑必看:撰写搜索引擎喜爱的标题

    导读:非常有干货,百度站长平台刚发布了这篇篇文章,文章建议:1,标题字数控制在65个字节内,2,重要内容放在标题的最前面,3,添加与网页内容最相关的.用户更常用的.满足用户明确需求的.体现时效性.关键 ...

  4. 我XXXX!!!够了!!!从github拉到dockerhub,再用daocloud加速下载

    史上比较曲折的救国方式了... 先在git hub上申请帐号,导入dockerfile. 然后在docker hub上关联git hub帐号作自动构建. 再用daocloud作加速,将docker i ...

  5. Go语言练习之方法,接口,并发

    多练练,有感觉了就写实际的东东. package main import ( "fmt" "math" "os" "time&qu ...

  6. 求问asp.net mvc发布问题

    正常发布 浏览后如下

  7. js对象替换键值名称

    js对象替换键值名称 将obj中的id和name字段替换分别替换成为“@id”,“@name” 代码如下: let obj = [{id:,name:,name:"李四"}].ma ...

  8. Aras Innovator 11 sp2 IE客户端设置

    在上一篇文章<Aras Innovator 11 sp2 安装>后,服务器算是安装好了,还需要在使用的客户端进行设置才可以正常使用Aras Innovator 该篇为IE设置,还有< ...

  9. Python使用boto3操作AWS S3中踩过的坑

    最近在AWS上开发部署应用. 看了这篇关于AWS中国区填坑的文章,结合自己使用AWS的经历,补充两个我自己填的坑. http://www.jianshu.com/p/0d0fd39a40c9?utm_ ...

  10. uva 10648(简单dp)

    Recently one of my friend Tarik became a member of the food committee of an ACM regional competition ...