Spider类的一些自定制

  1. # Spider类 自定义 起始解析器
  2. def start_requests(self):
  3. for url in self.start_urls:
  4. yield Request(url=url,callback=self.myparse)
  5.  
  6. # 起始会先经过这个视图函数返回一个 列表或者 一个生成器
  1. yield Request(url=page_url, callback=self.parse) #指定解析函数

parse函数的参数之response

  1. # print(response.request) # 将请求对象也封装在了response中
  2. # print(response.url) # 请求的url
  3. # print(response.headers) # 响应的头
  4. # print(response.headers['Set-Cookie']) # 原始cookies
  5.  
  6. ['_auto_detect_fun', '_body', '_body_declared_encoding', '_body_inferred
  7. _encoding', '_cached_benc', '_cached_selector', '_cached_ubody', '_declared_encoding', '_encoding', '_get_body', '_get_url', '_headers_encoding', '_se
  8. t_body', '_set_url', '_url', 'body', 'body_as_unicode', 'copy', 'css', 'encoding', 'flags', 'follow', 'headers', 'meta', 'replace', 'request', 'select
  9. or', 'status', 'text', 'url', 'urljoin', 'xpath']

参数

获取状态码 response.status

CookiesMiddleware

class scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware

该中间件使得爬取需要cookie(例如使用session)的网站成为了可能。 其追踪了web server发送的cookie,并在之后的request中发送回去, 就如浏览器所做的那样。

以下设置可以用来配置cookie中间件:

单spider多cookie session

0.15 新版功能.

Scrapy通过使用 cookiejar Request meta key来支持单spider追踪多cookie session。 默认情况下其使用一个cookie jar(session),不过您可以传递一个标示符来使用多个。

例如:

  1. for i, url in enumerate(urls):
  2. yield scrapy.Request("http://www.example.com", meta={'cookiejar': i},
  3. callback=self.parse_page)

需要注意的是 cookiejar meta key不是”黏性的(sticky)”。 您需要在之后的request请求中接着传递。例如:

  1. def parse_page(self, response):
  2. # do some processing
  3. return scrapy.Request("http://www.example.com/otherpage",
  4. meta={'cookiejar': response.meta['cookiejar']},
  5. callback=self.parse_other_page)
  1. # from scrapy.http.cookies import CookieJar
    # cookiejar = CookieJar()
    # cookiejar.extract_cookies(response, response.request)
    # temp = {}
    # for k, v in cookiejar._cookies.items():
    # for i, j in v.items():
    # for m, n in j.items():
    # temp[m] = n.value
    # self.cookie_dic.update(temp)
    yeild Request(url=url,
    cookies = self.cookie_dic
    )
    # 该方法是原始的手动操作cookie

COOKIES_ENABLED

默认: True

是否启用cookies middleware。如果关闭,cookies将不会发送给web server。

COOKIES_DEBUG

默认: False

如果启用,Scrapy将记录所有在request(Cookie 请求头)发送的cookies及response接收到的cookies(Set-Cookie 接收头)。

下边是启用 COOKIES_DEBUG 的记录的样例:

  1. 2011-04-06 14:35:10-0300 [diningcity] INFO: Spider opened
  2. 2011-04-06 14:35:10-0300 [diningcity] DEBUG: Sending cookies to: <GET http://www.diningcity.com/netherlands/index.html>
  3. Cookie: clientlanguage_nl=en_EN
  4. 2011-04-06 14:35:14-0300 [diningcity] DEBUG: Received cookies from: <200 http://www.diningcity.com/netherlands/index.html>
  5. Set-Cookie: JSESSIONID=B~FA4DC0C496C8762AE4F1A620EAB34F38; Path=/
  6. Set-Cookie: ip_isocode=US
  7. Set-Cookie: clientlanguage_nl=en_EN; Expires=Thu, 07-Apr-2011 21:21:34 GMT; Path=/
  8. 2011-04-06 14:49:50-0300 [diningcity] DEBUG: Crawled (200) <GET http://www.diningcity.com/netherlands/index.html> (referer: None)
  9. [...]

pipelines

  1. 当根据配置文件:
  2. ITEM_PIPELINES = {
  3. 'xiangmu.pipelines.FilePipeline': 300,
  4. 'xiangmu.pipelines.DBPipeline': 301,
  5. }
  6. 找到相关的类:FilePipeline之后,会优先判断类中是否含有 from_crawler
  7. 如果有:
  8. obj = FilePipeline.from_crawler()
  9. 没有则:
  10. obj = FilePipeline()
  11.  
  12. obj.open_spider(..)
  13. ob.process_item(..)
  14. obj.close_spider(..)
  1. class FilePipeline(object):
  2. def __init__(self,path):
  3. self.path = path
  4. self.f = None
  5.  
  6. @classmethod
  7. def from_crawler(cls, crawler):
  8. """
  9. 初始化时候,用于创建pipeline对象
  10. :param crawler:
  11. :return:
  12. """
  13. # return cls()
  14.  
  15. path = crawler.settings.get('XL_FILE_PATH')
  16. return cls(path)
  17.  
  18. def process_item(self, item, spider):
  19. self.f.write(item['href']+'\n')
  20. return item
  21. # 表示将item丢弃,不会被后续pipeline处理
  22. # raise DropItem()
  23.  
  24. def open_spider(self, spider):
  25. """
  26. 爬虫开始执行时,调用
  27. :param spider:
  28. :return:
  29. """
  30. self.f = open(self.path,'w')
  31.  
  32. def close_spider(self, spider):
  33. """
  34. 爬虫关闭时,被调用
  35. :param spider:
  36. :return:
  37. """
  38. self.f.close()

自定义文件pipline

  1. class DBPipeline():
  2.  
  3. def __init__(self,conn):
  4. self.conn = conn
  5.  
  6. @classmethod
  7. def from_crawler(cls,crawler):
  8. import pymysql
  9. conn = pymysql.connect(host='127.0.0.1',password='',port=3306,db='s10',user='root',charset='utf8')
  10. # pool = crawler.settings.get('MYSQLPOOL')
  11. return cls(conn)
  12.  
  13. def process_item(self,item,spider):
  14. print(item['href'],item['text'])
  15. self.cursor.execute('insert into url (url,title) VALUES (%s,%s)',(item['href'],item['text']))
  16.  
  17. return item
  18. def open_spider(self,spider):
  19. self.cursor = self.conn.cursor()
  20. def close_spider(self,spider):
  21. self.conn.commit()
  22. self.conn.close()

自定义数据库piplines

去重DupeFilter

在scrapy 中是有个默认去重的

如果想自定义去重需要在setting

  1. DUPEFILTER_CLASS = 'pachong.dupefilter.MyDupeFilter'
  1. from scrapy.dupefilter import BaseDupeFilter
  2. from scrapy.utils.request import request_fingerprint
  3. class MyDupeFilter(BaseDupeFilter):
  4.  
  5. def __init__(self, path=None, debug=False):
  6. self.record = set()
  7.  
  8. @classmethod
  9. def from_settings(cls, settings):
  10.  
  11. return cls()
  12.  
  13. def request_seen(self, request):
  14. fingerprint = request_fingerprint(request)
  15. if fingerprint in self.record:
  16. print('已经访问过')
  17. return True
  18. else:
  19. self.record.add(fingerprint)
  20.  
  21. def request_fingerprint(self, request):
  22. pass
  23.  
  24. def close(self, reason):
  25.  
  26. pass

自定义去重

中间件

  1. class SpiderMiddleware(object):
  2.  
  3. def process_spider_input(self,response, spider):
  4. """
  5. 下载完成,执行,然后交给parse处理
  6. :param response:
  7. :param spider:
  8. :return:
  9. """
  10. pass
  11.  
  12. def process_spider_output(self,response, result, spider):
  13. """
  14. spider处理完成,返回时调用
  15. :param response:
  16. :param result:
  17. :param spider:
  18. :return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)
  19. """
  20. return result
  21.  
  22. def process_spider_exception(self,response, exception, spider):
  23. """
  24. 异常调用
  25. :param response:
  26. :param exception:
  27. :param spider:
  28. :return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline
  29. """
  30. return None
  31.  
  32. def process_start_requests(self,start_requests, spider):
  33. """
  34. 爬虫启动时调用
  35. :param start_requests:
  36. :param spider:
  37. :return: 包含 Request 对象的可迭代对象
  38. """
  39. return start_requests
  40.  
  41. 爬虫中间件

下载器中间件

  1. class DownMiddleware1(object):
  2. def process_request(self, request, spider):
  3. """
  4. 请求需要被下载时,经过所有下载器中间件的process_request调用
  5. :param request:
  6. :param spider:
  7. :return:
  8. None,继续后续中间件去下载;
  9. Response对象,停止process_request的执行,开始执行process_response
  10. Request对象,停止中间件的执行,将Request重新调度器
  11. raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception
  12. """
  13. pass
  14.  
  15. def process_response(self, request, response, spider):
  16. """
  17. spider处理完成,返回时调用
  18. :param response:
  19. :param result:
  20. :param spider:
  21. :return:
  22. Response 对象:转交给其他中间件process_response
  23. Request 对象:停止中间件,request会被重新调度下载
  24. raise IgnoreRequest 异常:调用Request.errback
  25. """
  26. print('response1')
  27. return response
  28.  
  29. def process_exception(self, request, exception, spider):
  30. """
  31. 当下载处理器(download handler)或 process_request() (下载中间件)抛出异常
  32. :param response:
  33. :param exception:
  34. :param spider:
  35. :return:
  36. None:继续交给后续中间件处理异常;
  37. Response对象:停止后续process_exception方法
  38. Request对象:停止中间件,request将会被重新调用下载
  39. """
  40. return None

下载中间件

其他

  1. # -*- coding: utf-8 -*-
  2.  
  3. # Scrapy settings for step8_king project
  4. #
  5. # For simplicity, this file contains only settings considered important or
  6. # commonly used. You can find more settings consulting the documentation:
  7. #
  8. # http://doc.scrapy.org/en/latest/topics/settings.html
  9. # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
  10. # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
  11.  
  12. # 1. 爬虫名称
  13. BOT_NAME = 'step8_king'
  14.  
  15. # 2. 爬虫应用路径
  16. SPIDER_MODULES = ['step8_king.spiders']
  17. NEWSPIDER_MODULE = 'step8_king.spiders'
  18.  
  19. # Crawl responsibly by identifying yourself (and your website) on the user-agent
  20. # 3. 客户端 user-agent请求头
  21. # USER_AGENT = 'step8_king (+http://www.yourdomain.com)'
  22.  
  23. # Obey robots.txt rules
  24. # 4. 禁止爬虫配置
  25. # ROBOTSTXT_OBEY = False
  26.  
  27. # Configure maximum concurrent requests performed by Scrapy (default: 16)
  28. # 5. 并发请求数
  29. # CONCURRENT_REQUESTS = 4
  30.  
  31. # Configure a delay for requests for the same website (default: 0)
  32. # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
  33. # See also autothrottle settings and docs
  34. # 6. 延迟下载秒数
  35. # DOWNLOAD_DELAY = 2
  36.  
  37. # The download delay setting will honor only one of:
  38. # 7. 单域名访问并发数,并且延迟下次秒数也应用在每个域名
  39. # CONCURRENT_REQUESTS_PER_DOMAIN = 2
  40. # 单IP访问并发数,如果有值则忽略:CONCURRENT_REQUESTS_PER_DOMAIN,并且延迟下次秒数也应用在每个IP
  41. # CONCURRENT_REQUESTS_PER_IP = 3
  42.  
  43. # Disable cookies (enabled by default)
  44. # 8. 是否支持cookie,cookiejar进行操作cookie
  45. # COOKIES_ENABLED = True
  46. # COOKIES_DEBUG = True
  47.  
  48. # Disable Telnet Console (enabled by default)
  49. # 9. Telnet用于查看当前爬虫的信息,操作爬虫等...
  50. # 使用telnet ip port ,然后通过命令操作
  51. # TELNETCONSOLE_ENABLED = True
  52. # TELNETCONSOLE_HOST = '127.0.0.1'
  53. # TELNETCONSOLE_PORT = [6023,]
  54.  
  55. # 10. 默认请求头
  56. # Override the default request headers:
  57. # DEFAULT_REQUEST_HEADERS = {
  58. # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  59. # 'Accept-Language': 'en',
  60. # }
  61.  
  62. # Configure item pipelines
  63. # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
  64. # 11. 定义pipeline处理请求
  65. # ITEM_PIPELINES = {
  66. # 'step8_king.pipelines.JsonPipeline': 700,
  67. # 'step8_king.pipelines.FilePipeline': 500,
  68. # }
  69.  
  70. # 12. 自定义扩展,基于信号进行调用
  71. # Enable or disable extensions
  72. # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
  73. # EXTENSIONS = {
  74. # # 'step8_king.extensions.MyExtension': 500,
  75. # }
  76.  
  77. # 13. 爬虫允许的最大深度,可以通过meta查看当前深度;0表示无深度
  78. # DEPTH_LIMIT = 3
  79.  
  80. # 14. 爬取时,0表示深度优先Lifo(默认);1表示广度优先FiFo
  81.  
  82. # 后进先出,深度优先
  83. # DEPTH_PRIORITY = 0
  84. # SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue'
  85. # SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue'
  86. # 先进先出,广度优先
  87.  
  88. # DEPTH_PRIORITY = 1
  89. # SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
  90. # SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'
  91.  
  92. # 15. 调度器队列
  93. # SCHEDULER = 'scrapy.core.scheduler.Scheduler'
  94. # from scrapy.core.scheduler import Scheduler
  95.  
  96. # 16. 访问URL去重
  97. # DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl'
  98.  
  99. # Enable and configure the AutoThrottle extension (disabled by default)
  100. # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
  101.  
  102. """
  103. 17. 自动限速算法
  104. from scrapy.contrib.throttle import AutoThrottle
  105. 自动限速设置
  106. 1. 获取最小延迟 DOWNLOAD_DELAY
  107. 2. 获取最大延迟 AUTOTHROTTLE_MAX_DELAY
  108. 3. 设置初始下载延迟 AUTOTHROTTLE_START_DELAY
  109. 4. 当请求下载完成后,获取其"连接"时间 latency,即:请求连接到接受到响应头之间的时间
  110. 5. 用于计算的... AUTOTHROTTLE_TARGET_CONCURRENCY
  111. target_delay = latency / self.target_concurrency
  112. new_delay = (slot.delay + target_delay) / 2.0 # 表示上一次的延迟时间
  113. new_delay = max(target_delay, new_delay)
  114. new_delay = min(max(self.mindelay, new_delay), self.maxdelay)
  115. slot.delay = new_delay
  116. """
  117.  
  118. # 开始自动限速
  119. # AUTOTHROTTLE_ENABLED = True
  120. # The initial download delay
  121. # 初始下载延迟
  122. # AUTOTHROTTLE_START_DELAY = 5
  123. # The maximum download delay to be set in case of high latencies
  124. # 最大下载延迟
  125. # AUTOTHROTTLE_MAX_DELAY = 10
  126. # The average number of requests Scrapy should be sending in parallel to each remote server
  127. # 平均每秒并发数
  128. # AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
  129.  
  130. # Enable showing throttling stats for every response received:
  131. # 是否显示
  132. # AUTOTHROTTLE_DEBUG = True
  133.  
  134. # Enable and configure HTTP caching (disabled by default)
  135. # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
  136.  
  137. """
  138. 18. 启用缓存
  139. 目的用于将已经发送的请求或相应缓存下来,以便以后使用
  140.  
  141. from scrapy.downloadermiddlewares.httpcache import HttpCacheMiddleware
  142. from scrapy.extensions.httpcache import DummyPolicy
  143. from scrapy.extensions.httpcache import FilesystemCacheStorage
  144. """
  145. # 是否启用缓存策略
  146. # HTTPCACHE_ENABLED = True
  147.  
  148. # 缓存策略:所有请求均缓存,下次在请求直接访问原来的缓存即可
  149. # HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy"
  150. # 缓存策略:根据Http响应头:Cache-Control、Last-Modified 等进行缓存的策略
  151. # HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy"
  152.  
  153. # 缓存超时时间
  154. # HTTPCACHE_EXPIRATION_SECS = 0
  155.  
  156. # 缓存保存路径
  157. # HTTPCACHE_DIR = 'httpcache'
  158.  
  159. # 缓存忽略的Http状态码
  160. # HTTPCACHE_IGNORE_HTTP_CODES = []
  161.  
  162. # 缓存存储的插件
  163. # HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
  164.  
  165. """
  166. 19. 代理,需要在环境变量中设置
  167. from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware
  168.  
  169. 方式一:使用默认
  170. os.environ
  171. {
  172. http_proxy:http://root:woshiniba@192.168.11.11:9999/
  173. https_proxy:http://192.168.11.11:9999/
  174. }
  175. 方式二:使用自定义下载中间件
  176.  
  177. def to_bytes(text, encoding=None, errors='strict'):
  178. if isinstance(text, bytes):
  179. return text
  180. if not isinstance(text, six.string_types):
  181. raise TypeError('to_bytes must receive a unicode, str or bytes '
  182. 'object, got %s' % type(text).__name__)
  183. if encoding is None:
  184. encoding = 'utf-8'
  185. return text.encode(encoding, errors)
  186.  
  187. class ProxyMiddleware(object):
  188. def process_request(self, request, spider):
  189. PROXIES = [
  190. {'ip_port': '111.11.228.75:80', 'user_pass': ''},
  191. {'ip_port': '120.198.243.22:80', 'user_pass': ''},
  192. {'ip_port': '111.8.60.9:8123', 'user_pass': ''},
  193. {'ip_port': '101.71.27.120:80', 'user_pass': ''},
  194. {'ip_port': '122.96.59.104:80', 'user_pass': ''},
  195. {'ip_port': '122.224.249.122:8088', 'user_pass': ''},
  196. ]
  197. proxy = random.choice(PROXIES)
  198. if proxy['user_pass'] is not None:
  199. request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
  200. encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass']))
  201. request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)
  202. print "**************ProxyMiddleware have pass************" + proxy['ip_port']
  203. else:
  204. print "**************ProxyMiddleware no pass************" + proxy['ip_port']
  205. request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])
  206.  
  207. DOWNLOADER_MIDDLEWARES = {
  208. 'step8_king.middlewares.ProxyMiddleware': 500,
  209. }
  210.  
  211. """
  212.  
  213. """
  214. 20. Https访问
  215. Https访问时有两种情况:
  216. 1. 要爬取网站使用的可信任证书(默认支持)
  217. DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
  218. DOWNLOADER_CLIENTCONTEXTFACTORY = "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory"
  219.  
  220. 2. 要爬取网站使用的自定义证书
  221. DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"
  222. DOWNLOADER_CLIENTCONTEXTFACTORY = "step8_king.https.MySSLFactory"
  223.  
  224. # https.py
  225. from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory
  226. from twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate)
  227.  
  228. class MySSLFactory(ScrapyClientContextFactory):
  229. def getCertificateOptions(self):
  230. from OpenSSL import crypto
  231. v1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.key.unsecure', mode='r').read())
  232. v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.pem', mode='r').read())
  233. return CertificateOptions(
  234. privateKey=v1, # pKey对象
  235. certificate=v2, # X509对象
  236. verify=False,
  237. method=getattr(self, 'method', getattr(self, '_ssl_method', None))
  238. )
  239. 其他:
  240. 相关类
  241. scrapy.core.downloader.handlers.http.HttpDownloadHandler
  242. scrapy.core.downloader.webclient.ScrapyHTTPClientFactory
  243. scrapy.core.downloader.contextfactory.ScrapyClientContextFactory
  244. 相关配置
  245. DOWNLOADER_HTTPCLIENTFACTORY
  246. DOWNLOADER_CLIENTCONTEXTFACTORY
  247.  
  248. """
  249.  
  250. """
  251. 21. 爬虫中间件
  252. class SpiderMiddleware(object):
  253.  
  254. def process_spider_input(self,response, spider):
  255. '''
  256. 下载完成,执行,然后交给parse处理
  257. :param response:
  258. :param spider:
  259. :return:
  260. '''
  261. pass
  262.  
  263. def process_spider_output(self,response, result, spider):
  264. '''
  265. spider处理完成,返回时调用
  266. :param response:
  267. :param result:
  268. :param spider:
  269. :return: 必须返回包含 Request 或 Item 对象的可迭代对象(iterable)
  270. '''
  271. return result
  272.  
  273. def process_spider_exception(self,response, exception, spider):
  274. '''
  275. 异常调用
  276. :param response:
  277. :param exception:
  278. :param spider:
  279. :return: None,继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline
  280. '''
  281. return None
  282.  
  283. def process_start_requests(self,start_requests, spider):
  284. '''
  285. 爬虫启动时调用
  286. :param start_requests:
  287. :param spider:
  288. :return: 包含 Request 对象的可迭代对象
  289. '''
  290. return start_requests
  291.  
  292. 内置爬虫中间件:
  293. 'scrapy.contrib.spidermiddleware.httperror.HttpErrorMiddleware': 50,
  294. 'scrapy.contrib.spidermiddleware.offsite.OffsiteMiddleware': 500,
  295. 'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': 700,
  296. 'scrapy.contrib.spidermiddleware.urllength.UrlLengthMiddleware': 800,
  297. 'scrapy.contrib.spidermiddleware.depth.DepthMiddleware': 900,
  298.  
  299. """
  300. # from scrapy.contrib.spidermiddleware.referer import RefererMiddleware
  301. # Enable or disable spider middlewares
  302. # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
  303. SPIDER_MIDDLEWARES = {
  304. # 'step8_king.middlewares.SpiderMiddleware': 543,
  305. }
  306.  
  307. """
  308. 22. 下载中间件
  309. class DownMiddleware1(object):
  310. def process_request(self, request, spider):
  311. '''
  312. 请求需要被下载时,经过所有下载器中间件的process_request调用
  313. :param request:
  314. :param spider:
  315. :return:
  316. None,继续后续中间件去下载;
  317. Response对象,停止process_request的执行,开始执行process_response
  318. Request对象,停止中间件的执行,将Request重新调度器
  319. raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception
  320. '''
  321. pass
  322.  
  323. def process_response(self, request, response, spider):
  324. '''
  325. spider处理完成,返回时调用
  326. :param response:
  327. :param result:
  328. :param spider:
  329. :return:
  330. Response 对象:转交给其他中间件process_response
  331. Request 对象:停止中间件,request会被重新调度下载
  332. raise IgnoreRequest 异常:调用Request.errback
  333. '''
  334. print('response1')
  335. return response
  336.  
  337. def process_exception(self, request, exception, spider):
  338. '''
  339. 当下载处理器(download handler)或 process_request() (下载中间件)抛出异常
  340. :param response:
  341. :param exception:
  342. :param spider:
  343. :return:
  344. None:继续交给后续中间件处理异常;
  345. Response对象:停止后续process_exception方法
  346. Request对象:停止中间件,request将会被重新调用下载
  347. '''
  348. return None
  349.  
  350. 默认下载中间件
  351. {
  352. 'scrapy.contrib.downloadermiddleware.robotstxt.RobotsTxtMiddleware': 100,
  353. 'scrapy.contrib.downloadermiddleware.httpauth.HttpAuthMiddleware': 300,
  354. 'scrapy.contrib.downloadermiddleware.downloadtimeout.DownloadTimeoutMiddleware': 350,
  355. 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,
  356. 'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 500,
  357. 'scrapy.contrib.downloadermiddleware.defaultheaders.DefaultHeadersMiddleware': 550,
  358. 'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580,
  359. 'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware': 590,
  360. 'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600,
  361. 'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700,
  362. 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 750,
  363. 'scrapy.contrib.downloadermiddleware.chunked.ChunkedTransferMiddleware': 830,
  364. 'scrapy.contrib.downloadermiddleware.stats.DownloaderStats': 850,
  365. 'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900,
  366. }
  367.  
  368. """
  369. # from scrapy.contrib.downloadermiddleware.httpauth import HttpAuthMiddleware
  370. # Enable or disable downloader middlewares
  371. # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
  372. # DOWNLOADER_MIDDLEWARES = {
  373. # 'step8_king.middlewares.DownMiddleware1': 100,
  374. # 'step8_king.middlewares.DownMiddleware2': 500,
  375. # }
  376.  
  377. settings

其他参数

scrapy 相关的更多相关文章

  1. scrapy相关:splash 实践

    0. 1.参考 https://github.com/scrapy-plugins/scrapy-splash#configuration 以此为准 scrapy相关:splash安装 A javas ...

  2. scrapy相关:splash安装 A javascript rendering service 渲染

    0. splash: 美人鱼  溅,泼 1.参考 Splash使用初体验 docker在windows下的安装 https://blog.scrapinghub.com/2015/03/02/hand ...

  3. scrapy相关 通过设置 FEED_EXPORT_ENCODING 解决 unicode 中文写入json文件出现`\uXXXX`

    0.问题现象 爬取 item: 2017-10-16 18:17:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.hu ...

  4. 【python】scrapy相关

    目前scrapy还不支持python3,python2.7与python3.5共存时安装scrapy后,执行scrapy后报错 Traceback (most recent call last): F ...

  5. linux下scrapy环境搭建

    最近使用scrapy做数据挖掘,使用scrapy定时抓取数据并存入MongoDB,本文记录环境搭建过程以作备忘 OS:ubuntu 14.04  python:2.7.6 scrapy:1.0.5 D ...

  6. pycharm创建scrapy项目教程及遇到的坑

    最近学习scrapy爬虫框架,在使用pycharm安装scrapy类库及创建scrapy项目时花费了好长的时间,遇到各种坑,根据网上的各种教程,花费了一晚上的时间,终于成功,其中也踩了一些坑,现在整理 ...

  7. python-爬虫框架scrapy

    一 介绍 Scrapy一个开源和协作的框架,其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的,使用它可以以快速.简单.可扩展的方式从网站中提取所需的数据.但目前Scrapy的用途十分广泛,可 ...

  8. 如何运行简单的scrapy

    1.建scrapy工程 scrapy startproject python123demo 2.在工程中写一个爬虫文件 cd python123demo scrapy genspider demo p ...

  9. Scrapy框架——介绍、安装、命令行创建,启动、项目目录结构介绍、Spiders文件夹详解(包括去重规则)、Selectors解析页面、Items、pipelines(自定义pipeline)、下载中间件(Downloader Middleware)、爬虫中间件、信号

    一 介绍 Scrapy一个开源和协作的框架,其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的,使用它可以以快速.简单.可扩展的方式从网站中提取所需的数据.但目前Scrapy的用途十分广泛,可 ...

随机推荐

  1. ES6 数值扩展

    1.Number.isNan 和 Number.isFinite Number.isNaN()用来检查一个值是否为NaN Number.isNaN(NaN) // true Number.isNaN( ...

  2. 萌新学习Python爬取B站弹幕+R语言分词demo说明

    代码地址如下:http://www.demodashi.com/demo/11578.html 一.写在前面 之前在简书首页看到了Python爬虫的介绍,于是就想着爬取B站弹幕并绘制词云,因此有了这样 ...

  3. Sphinx-实战

    配置完成后, 有了数据源.索引文件存储位置等, 便可以使用 indexer 工具创建索引, 收集要被检索的数据 -c 指定配置文件 默认使用 etc/sphinx.conf --all 对所有索引重新 ...

  4. HTML-HTML5+CSS3权威指南阅读(三、CSS3)

    不同的浏览器(包括-moz-代表的Mozilla Firefox, -ms-代表的Microsoft Internet Explorer等)厂商在发布正式版本之前之前, 试验各自对CSS3新特性的实现 ...

  5. Oracle SQL Developer出现错误 【ora-28002:the password will expire within 7 days】的解决办法

    启动 Oracle SQL Developer的时候,点击用户system进行连接并输入密码后(下图左),会出现(下图右)提示信息: 即:[ora-28002:the password will ex ...

  6. C# 反射只获取自己定义的属性,不获取父类的属性

    PropertyInfo[] p = user.GetType().GetProperties(BindingFlags.DeclaredOnly | BindingFlags.Public | Bi ...

  7. java启动参数 设置

    JAVA_MEM_OPTS="" BITS=`java -version 2>&1 | grep -i 64-bit` if [ -n "$BITS&quo ...

  8. shared-service.ts

    shared-service.ts import { Observable } from 'rxjs/Observable'; import { Injectable } from '@angular ...

  9. linux下使用dd命令写入镜像文件到u盘

    1.使用 df -h ,查看一下当前各个磁盘 user@host ~/ $ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 119 ...

  10. CvSplit

    /* possible split in the tree */ typedef struct CvSplit { CvTreeCascadeNode* parent; CvTreeCascadeNo ...