第六课主要内容:

  • 爬豆瓣文本例程 douban
  • 图片例程 douban_imgs

1.爬豆瓣文本例程 douban

目录结构

  1. douban
  2. --douban
  3. --spiders
  4. --__init__.py
  5. --bookspider.py
  6. --douban_comment_spider.py
  7. --doumailspider.py
  8. --__init__.py
  9. --items.py
  10. --pipelines.py
  11. --settings.py
  12. --scrapy.cfg

–spiders–init.py

  1. # This package will contain the spiders of your Scrapy project
  2. #
  3. # Please refer to the documentation for information on how to create and manage
  4. # your spiders.

bookspider.py

  1. # -*- coding:utf-8 -*-
  2. '''by sudo rm -rf http://imchenkun.com'''
  3. import scrapy
  4. from douban.items import DoubanBookItem
  5.  
  6. class BookSpider(scrapy.Spider):
  7. name = 'douban-book'
  8. allowed_domains = ['douban.com']
  9. start_urls = [
  10. 'https://book.douban.com/top250'
  11. ]
  12.  
  13. def parse(self, response):
  14. # 请求第一页
  15. yield scrapy.Request(response.url, callback=self.parse_next)
  16.  
  17. # 请求其它页
  18. for page in response.xpath('//div[@class="paginator"]/a'):
  19. link = page.xpath('@href').extract()[0]
  20. yield scrapy.Request(link, callback=self.parse_next)
  21.  
  22. def parse_next(self, response):
  23. for item in response.xpath('//tr[@class="item"]'):
  24. book = DoubanBookItem()
  25. book['name'] = item.xpath('td[2]/div[1]/a/@title').extract()[0]
  26. book['content'] = item.xpath('td[2]/p/text()').extract()[0]
  27. book['ratings'] = item.xpath('td[2]/div[2]/span[2]/text()').extract()[0]
  28. yield book

douban_comment_spider.py

  1. # -*- coding:utf-8 -*-
    #python2中提示输入为raw_input,python3为input;Python2 print不需要加括号,Python3全加。大家可以根据自己的版本自行优化
  2. import scrapy
  3. from faker import Factory
  4. from douban.items import DoubanMovieCommentItem
  5. import urllib.parse
  6. f = Factory.create()
  7.  
  8. class MailSpider(scrapy.Spider):
  9. name = 'douban-comment'
  10. allowed_domains = ['accounts.douban.com', 'douban.com']
  11. start_urls = [
  12. 'https://www.douban.com/'
  13. ]
  14.  
  15. headers = {
  16. 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  17. 'Accept-Encoding': 'gzip, deflate, br',
  18. 'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
  19. 'Connection': 'keep-alive',
  20. 'Host': 'accounts.douban.com',
  21. 'User-Agent': f.user_agent()
  22. }
  23.  
  24. formdata = {
  25. 'form_email': '你的邮箱',
  26. 'form_password': '你的密码',
  27. # 'captcha-solution': '',
  28. # 'captcha-id': '',
  29. 'login': '登录',
  30. 'redir': 'https://www.douban.com/',
  31. 'source': 'None'
  32. }
  33.  
  34. def start_requests(self):
  35. return [scrapy.Request(url='https://www.douban.com/accounts/login',
  36. headers=self.headers,
  37. meta={'cookiejar': 1},
  38. callback=self.parse_login)]
  39.  
  40. def parse_login(self, response):
  41. # 如果有验证码要人为处理
  42. if 'captcha_image' in response.body:
  43. print('Copy the link:')
  44. link = response.xpath('//img[@class="captcha_image"]/@src').extract()[0]
  45. print(link)
  46. captcha_solution = input('captcha-solution:')
  47. captcha_id = urllib.parse_qs(urllib.urlparse(link).query, True)['id']
  48. self.formdata['captcha-solution'] = captcha_solution
  49. self.formdata['captcha-id'] = captcha_id
  50. return [scrapy.FormRequest.from_response(response,
  51. formdata=self.formdata,
  52. headers=self.headers,
  53. meta={'cookiejar': response.meta['cookiejar']},
  54. callback=self.after_login
  55. )]
  56.  
  57. def after_login(self, response):
  58. print (response.status)
  59. self.headers['Host'] = "www.douban.com"
  60. yield scrapy.Request(url='https://movie.douban.com/subject/22266320/reviews',
  61. meta={'cookiejar': response.meta['cookiejar']},
  62. headers=self.headers,
  63. callback=self.parse_comment_url)
  64. yield scrapy.Request(url='https://movie.douban.com/subject/22266320/reviews',
  65. meta={'cookiejar': response.meta['cookiejar']},
  66. headers=self.headers,
  67. callback=self.parse_next_page,
  68. dont_filter = True) #不去重
  69.  
  70. def parse_next_page(self, response):
  71. print (response.status)
  72. try:
  73. next_url = response.urljoin(response.xpath('//span[@class="next"]/a/@href').extract()[0])
  74. print ("下一页")
  75. print (next_url)
  76. yield scrapy.Request(url=next_url,
  77. meta={'cookiejar': response.meta['cookiejar']},
  78. headers=self.headers,
  79. callback=self.parse_comment_url,
  80. dont_filter = True)
  81. yield scrapy.Request(url=next_url,
  82. meta={'cookiejar': response.meta['cookiejar']},
  83. headers=self.headers,
  84. callback=self.parse_next_page,
  85. dont_filter = True)
  86. except:
  87. print ("Next page Error")
  88. return
  89.  
  90. def parse_comment_url(self, response):
  91. print (response.status)
  92. for item in response.xpath('//div[@class="main review-item"]'):
  93. comment_url = item.xpath('header/h3[@class="title"]/a/@href').extract()[0]
  94. comment_title = item.xpath('header/h3[@class="title"]/a/text()').extract()[0]
  95. print (comment_title)
  96. print (comment_url)
  97. yield scrapy.Request(url=comment_url,
  98. meta={'cookiejar': response.meta['cookiejar']},
  99. headers=self.headers,
  100. callback=self.parse_comment)
  101.  
  102. def parse_comment(self, response):
  103. print (response.status)
  104. for item in response.xpath('//div[@id="content"]'):
  105. comment = DoubanMovieCommentItem()
  106. comment['useful_num'] = item.xpath('//div[@class="main-panel-useful"]/button[1]/text()').extract()[0].strip()
  107. comment['no_help_num'] = item.xpath('//div[@class="main-panel-useful"]/button[2]/text()').extract()[0].strip()
  108. comment['people'] = item.xpath('//span[@property="v:reviewer"]/text()').extract()[0]
  109. comment['people_url'] = item.xpath('//header[@class="main-hd"]/a[1]/@href').extract()[0]
  110. comment['star'] = item.xpath('//header[@class="main-hd"]/span[1]/@title').extract()[0]
  111.  
  112. data_type = item.xpath('//div[@id="link-report"]/div/@data-original').extract()[0]
  113. print ("data_type: "+data_type)
  114. if data_type == '0':
  115. comment['comment'] = "\t#####\t".join(map(lambda x:x.strip(), item.xpath('//div[@id="link-report"]/div/p/text()').extract()))
  116. elif data_type == '1':
  117. comment['comment'] = "\t#####\t".join(map(lambda x:x.strip(), item.xpath('//div[@id="link-report"]/div[1]/text()').extract()))
  118. comment['title'] = item.xpath('//span[@property="v:summary"]/text()').extract()[0]
  119. comment['comment_page_url'] = response.url
  120. #print comment
  121. yield comment

doumailspider.py

  1. # -*- coding:utf-8 -*-
  2. '''by sudo rm -rf http://imchenkun.com'''
  3. import scrapy
  4. from faker import Factory
  5. from douban.items import DoubanMailItem
  6. import urllib.parse
  7. f = Factory.create()
  8.  
  9. class MailSpider(scrapy.Spider):
  10. name = 'douban-mail'
  11. allowed_domains = ['accounts.douban.com', 'douban.com']
  12. start_urls = [
  13. 'https://www.douban.com/'
  14. ]
  15.  
  16. headers = {
  17. 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  18. 'Accept-Encoding': 'gzip, deflate, br',
  19. 'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
  20. 'Connection': 'keep-alive',
  21. 'Host': 'accounts.douban.com',
  22. 'User-Agent': f.user_agent()
  23. }
  24.  
  25. formdata = {
  26. 'form_email': '你的邮箱',
  27. 'form_password': '你的密码',
  28. # 'captcha-solution': '',
  29. # 'captcha-id': '',
  30. 'login': '登录',
  31. 'redir': 'https://www.douban.com/',
  32. 'source': 'None'
  33. }
  34.  
  35. def start_requests(self):
  36. return [scrapy.Request(url='https://www.douban.com/accounts/login',
  37. headers=self.headers,
  38. meta={'cookiejar': 1},
  39. callback=self.parse_login)]
  40.  
  41. def parse_login(self, response):
  42. # 如果有验证码要人为处理
  43. if 'captcha_image' in response.body:
  44. print ('Copy the link:')
  45. link = response.xpath('//img[@class="captcha_image"]/@src').extract()[0]
  46. print (link)
  47. captcha_solution = input('captcha-solution:')
  48. captcha_id = urllib.parse_qs(urllib.urlparse(link).query, True)['id']
  49. self.formdata['captcha-solution'] = captcha_solution
  50. self.formdata['captcha-id'] = captcha_id
  51. return [scrapy.FormRequest.from_response(response,
  52. formdata=self.formdata,
  53. headers=self.headers,
  54. meta={'cookiejar': response.meta['cookiejar']},
  55. callback=self.after_login
  56. )]
  57.  
  58. def after_login(self, response):
  59. print (response.status)
  60. self.headers['Host'] = "www.douban.com"
  61. return scrapy.Request(url='https://www.douban.com/doumail/',
  62. meta={'cookiejar': response.meta['cookiejar']},
  63. headers=self.headers,
  64. callback=self.parse_mail)
  65.  
  66. def parse_mail(self, response):
  67. print (response.status)
  68. for item in response.xpath('//div[@class="doumail-list"]/ul/li'):
  69. mail = DoubanMailItem()
  70. mail['sender_time'] = item.xpath('div[2]/div/span[1]/text()').extract()[0]
  71. mail['sender_from'] = item.xpath('div[2]/div/span[2]/text()').extract()[0]
  72. mail['url'] = item.xpath('div[2]/p/a/@href').extract()[0]
  73. mail['title'] = item.xpath('div[2]/p/a/text()').extract()[0]
  74. print (mail)
  75. yield mail

init.py

无代码

items.py

  1. # -*- coding: utf-8 -*-
  2. import scrapy
  3.  
  4. class DoubanBookItem(scrapy.Item):
  5. name = scrapy.Field() # 书名
  6. price = scrapy.Field() # 价格
  7. edition_year = scrapy.Field() # 出版年份
  8. publisher = scrapy.Field() # 出版社
  9. ratings = scrapy.Field() # 评分
  10. author = scrapy.Field() # 作者
  11. content = scrapy.Field()
  12.  
  13. class DoubanMailItem(scrapy.Item):
  14. sender_time = scrapy.Field() # 发送时间
  15. sender_from = scrapy.Field() # 发送人
  16. url = scrapy.Field() # 豆邮详细地址
  17. title = scrapy.Field() # 豆邮标题
  18.  
  19. class DoubanMovieCommentItem(scrapy.Item):
  20. useful_num = scrapy.Field() # 多少人评论有用
  21. no_help_num = scrapy.Field() # 多少人评论无用
  22. people = scrapy.Field() # 评论者
  23. people_url = scrapy.Field() # 评论者页面
  24. star = scrapy.Field() # 评分
  25. comment = scrapy.Field() # 评论
  26. title = scrapy.Field() # 标题
  27. comment_page_url = scrapy.Field()# 当前页

pipelines.py

  1. # -*- coding: utf-8 -*-
  2.  
  3. class DoubanBookPipeline(object):
  4. def process_item(self, item, spider):
  5. info = item['content'].split(' / ') # [法] 圣埃克苏佩里 / 马振聘 / 人民文学出版社 / 2003-8 / 22.00元
  6. item['name'] = item['name']
  7. item['price'] = info[-1]
  8. item['edition_year'] = info[-2]
  9. item['publisher'] = info[-3]
  10. return item
  11.  
  12. class DoubanMailPipeline(object):
  13. def process_item(self, item, spider):
  14. item['title'] = item['title'].replace(' ', '').replace('\\n', '')
  15. return item
  16.  
  17. class DoubanMovieCommentPipeline(object):
  18. def process_item(self, item, spider):
  19. return item

settings.py

  1. # -*- coding: utf-8 -*-
  2.  
  3. # Scrapy settings for douban project
  4. #
  5. # For simplicity, this file contains only settings considered important or
  6. # commonly used. You can find more settings consulting the documentation:
  7. #
  8. # http://doc.scrapy.org/en/latest/topics/settings.html
  9. # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
  10. # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
  11.  
  12. BOT_NAME = 'douban'
  13.  
  14. SPIDER_MODULES = ['douban.spiders']
  15. NEWSPIDER_MODULE = 'douban.spiders'
  16.  
  17. # Crawl responsibly by identifying yourself (and your website) on the user-agent
  18. from faker import Factory
  19. f = Factory.create()
  20. USER_AGENT = f.user_agent()
  21.  
  22. # Obey robots.txt rules
  23. ROBOTSTXT_OBEY = True
  24.  
  25. # Configure maximum concurrent requests performed by Scrapy (default: 16)
  26. #CONCURRENT_REQUESTS = 32
  27.  
  28. # Configure a delay for requests for the same website (default: 0)
  29. # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
  30. # See also autothrottle settings and docs
  31. #DOWNLOAD_DELAY = 3
  32. # The download delay setting will honor only one of:
  33. #CONCURRENT_REQUESTS_PER_DOMAIN = 16
  34. #CONCURRENT_REQUESTS_PER_IP = 16
  35.  
  36. # Disable cookies (enabled by default)
  37. #COOKIES_ENABLED = False
  38.  
  39. # Disable Telnet Console (enabled by default)
  40. #TELNETCONSOLE_ENABLED = False
  41.  
  42. # Override the default request headers:
  43. DEFAULT_REQUEST_HEADERS = {
  44. 'Host': 'book.douban.com',
  45. 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  46. 'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
  47. 'Accept-Encoding': 'gzip, deflate, br',
  48. 'Connection': 'keep-alive',
  49. }
  50. #DEFAULT_REQUEST_HEADERS = {
  51. # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  52. # 'Accept-Language': 'en',
  53. #}
  54.  
  55. # Enable or disable spider middlewares
  56. # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
  57. #SPIDER_MIDDLEWARES = {
  58. # 'douban.middlewares.MyCustomSpiderMiddleware': 543,
  59. #}
  60.  
  61. # Enable or disable downloader middlewares
  62. # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
  63. #DOWNLOADER_MIDDLEWARES = {
  64. # 'douban.middlewares.MyCustomDownloaderMiddleware': 543,
  65. #}
  66.  
  67. # Enable or disable extensions
  68. # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
  69. #EXTENSIONS = {
  70. # 'scrapy.extensions.telnet.TelnetConsole': None,
  71. #}
  72.  
  73. # Configure item pipelines
  74. # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
  75. ITEM_PIPELINES = {
  76. #'douban.pipelines.DoubanBookPipeline': 300,
  77. #'douban.pipelines.DoubanMailPipeline': 600,
  78. 'douban.pipelines.DoubanMovieCommentPipeline': 900,
  79. }
  80.  
  81. # Enable and configure the AutoThrottle extension (disabled by default)
  82. # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
  83. #AUTOTHROTTLE_ENABLED = True
  84. # The initial download delay
  85. #AUTOTHROTTLE_START_DELAY = 5
  86. # The maximum download delay to be set in case of high latencies
  87. #AUTOTHROTTLE_MAX_DELAY = 60
  88. # The average number of requests Scrapy should be sending in parallel to
  89. # each remote server
  90. #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
  91. # Enable showing throttling stats for every response received:
  92. #AUTOTHROTTLE_DEBUG = False
  93.  
  94. # Enable and configure HTTP caching (disabled by default)
  95. # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
  96. #HTTPCACHE_ENABLED = True
  97. #HTTPCACHE_EXPIRATION_SECS = 0
  98. #HTTPCACHE_DIR = 'httpcache'
  99. #HTTPCACHE_IGNORE_HTTP_CODES = []
  100. #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

scrapy.cfg

  1. # Automatically created by: scrapy startproject
  2. #
  3. # For more information about the [deploy] section see:
  4. # https://scrapyd.readthedocs.org/en/latest/deploy.html
  5.  
  6. [settings]
  7. default = douban.settings
  8.  
  9. [deploy]
  10. #url = http://localhost:6800/
  11. project = douban

2.图片例程 douban_imgs

目录结构

  1. douban_imgs
  2. --douban
  3. --spiders
  4. --__init__.py
  5. --download_douban.py
  6. --__init__.py
  7. --items.py
  8. --pipelines.py
  9. --run_spider.py
  10. --settings.py
  11. --scrapy.cfg

–spiders–init.py

  1. # This package will contain the spiders of your Scrapy project
  2. #
  3. # Please refer to the documentation for information on how to create and manage
  4. # your spiders.

download_douban.py

  1. # coding=utf-8
  2. from scrapy.spiders import Spider
  3. import re
  4. from scrapy import Request
  5. from douban_imgs.items import DoubanImgsItem
  6.  
  7. class download_douban(Spider):
  8. name = 'download_douban'
  9.  
  10. default_headers = {
  11. 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
  12. 'Accept-Encoding': 'gzip, deflate, sdch, br',
  13. 'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6',
  14. 'Cache-Control': 'max-age=0',
  15. 'Connection': 'keep-alive',
  16. 'Host': 'www.douban.com',
  17. 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
  18. }
  19.  
  20. def __init__(self, url='1638835355', *args, **kwargs):
  21. self.allowed_domains = ['douban.com']
  22. self.start_urls = [
  23. 'http://www.douban.com/photos/album/%s/' % (url)]
  24. self.url = url
  25. # call the father base function
  26.  
  27. #super(download_douban, self).__init__(*args, **kwargs)
  28.  
  29. def start_requests(self):
  30.  
  31. for url in self.start_urls:
  32. yield Request(url=url, headers=self.default_headers, callback=self.parse)
  33.  
  34. def parse(self, response):
  35. list_imgs = response.xpath('//div[@class="photolst clearfix"]//img/@src').extract()
  36. if list_imgs:
  37. item = DoubanImgsItem()
  38. item['image_urls'] = list_imgs
  39. yield item

init.py

此文件内无代码

items.py

  1. # -*- coding: utf-8 -*-
  2.  
  3. # Define here the models for your scraped items
  4. #
  5. # See documentation in:
  6. # http://doc.scrapy.org/en/latest/topics/items.html
  7.  
  8. import scrapy
  9. from scrapy import Item, Field
  10.  
  11. class DoubanImgsItem(scrapy.Item):
  12. # define the fields for your item here like:
  13. # name = scrapy.Field()
  14. image_urls = Field()
  15. images = Field()
  16. image_paths = Field()

pipelines.py

  1. # -*- coding: utf-8 -*-
  2.  
  3. # Define your item pipelines here
  4. #
  5. # Don't forget to add your pipeline to the ITEM_PIPELINES setting
  6. # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
  7. from scrapy.pipelines.images import ImagesPipeline
  8. from scrapy.exceptions import DropItem
  9. from scrapy import Request
  10. from scrapy import log
  11.  
  12. class DoubanImgsPipeline(object):
  13. def process_item(self, item, spider):
  14. return item
  15.  
  16. class DoubanImgDownloadPipeline(ImagesPipeline):
  17. default_headers = {
  18. 'accept': 'image/webp,image/*,*/*;q=0.8',
  19. 'accept-encoding': 'gzip, deflate, sdch, br',
  20. 'accept-language': 'zh-CN,zh;q=0.8,en;q=0.6',
  21. 'cookie': 'bid=yQdC/AzTaCw',
  22. 'referer': 'https://www.douban.com/photos/photo/2370443040/',
  23. 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
  24. }
  25.  
  26. def get_media_requests(self, item, info):
  27. for image_url in item['image_urls']:
  28. self.default_headers['referer'] = image_url
  29. yield Request(image_url, headers=self.default_headers)
  30.  
  31. def item_completed(self, results, item, info):
  32. image_paths = [x['path'] for ok, x in results if ok]
  33. if not image_paths:
  34. raise DropItem("Item contains no images")
  35. item['image_paths'] = image_paths
  36. return item

run_spider.py

  1. from scrapy import cmdline
  2. cmd_str = 'scrapy crawl download_douban'
  3. cmdline.execute(cmd_str.split(' '))

settings.py

  1. # -*- coding: utf-8 -*-
  2.  
  3. # Scrapy settings for douban_imgs project
  4. #
  5. # For simplicity, this file contains only settings considered important or
  6. # commonly used. You can find more settings consulting the documentation:
  7. #
  8. # http://doc.scrapy.org/en/latest/topics/settings.html
  9. # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
  10. # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
  11.  
  12. BOT_NAME = 'douban_imgs'
  13.  
  14. SPIDER_MODULES = ['douban_imgs.spiders']
  15. NEWSPIDER_MODULE = 'douban_imgs.spiders'
  16.  
  17. # Crawl responsibly by identifying yourself (and your website) on the user-agent
  18. # USER_AGENT = 'douban_imgs (+http://www.yourdomain.com)'
  19.  
  20. # Configure maximum concurrent requests performed by Scrapy (default: 16)
  21. # CONCURRENT_REQUESTS=32
  22.  
  23. # Configure a delay for requests for the same website (default: 0)
  24. # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
  25. # See also autothrottle settings and docs
  26. # DOWNLOAD_DELAY=3
  27. # The download delay setting will honor only one of:
  28. # CONCURRENT_REQUESTS_PER_DOMAIN=16
  29. # CONCURRENT_REQUESTS_PER_IP=16
  30.  
  31. # Disable cookies (enabled by default)
  32. # COOKIES_ENABLED=False
  33.  
  34. # Disable Telnet Console (enabled by default)
  35. # TELNETCONSOLE_ENABLED=False
  36.  
  37. # Override the default request headers:
  38. # DEFAULT_REQUEST_HEADERS = {
  39. # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  40. # 'Accept-Language': 'en',
  41. # }
  42.  
  43. # Enable or disable spider middlewares
  44. # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
  45. # SPIDER_MIDDLEWARES = {
  46. # 'douban_imgs.middlewares.MyCustomSpiderMiddleware': 543,
  47. # }
  48.  
  49. # Enable or disable downloader middlewares
  50. # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
  51. # DOWNLOADER_MIDDLEWARES = {
  52. # 'douban_imgs.middlewares.MyCustomDownloaderMiddleware': 543,
  53. # }
  54.  
  55. # Enable or disable extensions
  56. # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
  57. # EXTENSIONS = {
  58. # 'scrapy.telnet.TelnetConsole': None,
  59. # }
  60.  
  61. # Configure item pipelines
  62. # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
  63. ITEM_PIPELINES = {
  64. 'douban_imgs.pipelines.DoubanImgDownloadPipeline': 300,
  65. }
  66.  
  67. IMAGES_STORE = 'D:\\doubanimgs'
  68. #IMAGES_STORE = '/tmp'
  69.  
  70. IMAGES_EXPIRES = 90
  71.  
  72. # Enable and configure the AutoThrottle extension (disabled by default)
  73. # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
  74. # NOTE: AutoThrottle will honour the standard settings for concurrency and delay
  75. # AUTOTHROTTLE_ENABLED=True
  76. # The initial download delay
  77. # AUTOTHROTTLE_START_DELAY=5
  78. # The maximum download delay to be set in case of high latencies
  79. # AUTOTHROTTLE_MAX_DELAY=60
  80. # Enable showing throttling stats for every response received:
  81. # AUTOTHROTTLE_DEBUG=False
  82.  
  83. # Enable and configure HTTP caching (disabled by default)
  84. # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
  85. # HTTPCACHE_ENABLED=True
  86. # HTTPCACHE_EXPIRATION_SECS=0
  87. # HTTPCACHE_DIR='httpcache'
  88. # HTTPCACHE_IGNORE_HTTP_CODES=[]
  89. # HTTPCACHE_STORAGE='scrapy.extensions.httpcache.FilesystemCacheStorage'

scrapy.cfg

  1. # Automatically created by: scrapy startproject
  2. #
  3. # For more information about the [deploy] section see:
  4. # https://scrapyd.readthedocs.org/en/latest/deploy.html
  5.  
  6. [settings]
  7. default = douban_imgs.settings
  8.  
  9. [deploy]
  10. #url = http://localhost:6800/
  11. project = douban_imgs

七月在线爬虫班学习笔记(六)——scrapy爬虫整体示例的更多相关文章

  1. 七月在线爬虫班学习笔记(五)——scrapy spider的几种爬取方式

    第五课主要内容有: Scrapy框架结构,组件及工作方式 单页爬取-julyedu.com 拼URL爬取-博客园 循环下页方式爬取-toscrape.com Scrapy项目相关命令-QQ新闻 1.S ...

  2. 七月在线爬虫班学习笔记(二)——Python基本语法及面向对象

    第二课主要内容如下: 代码格式 基本语法 关键字 循环判断 函数 容器 面向对象 文件读写 多线程 错误处理 代码格式 syntax基本语法 a = 1234 print(a) a = 'abcd' ...

  3. scrapy爬虫框架学习笔记(一)

    scrapy爬虫框架学习笔记(一) 1.安装scrapy pip install scrapy 2.新建工程: (1)打开命令行模式 (2)进入要新建工程的目录 (3)运行命令: scrapy sta ...

  4. python3.4学习笔记(十三) 网络爬虫实例代码,使用pyspider抓取多牛投资吧里面的文章信息,抓取政府网新闻内容

    python3.4学习笔记(十三) 网络爬虫实例代码,使用pyspider抓取多牛投资吧里面的文章信息PySpider:一个国人编写的强大的网络爬虫系统并带有强大的WebUI,采用Python语言编写 ...

  5. Scrapy:学习笔记(2)——Scrapy项目

    Scrapy:学习笔记(2)——Scrapy项目 1.创建项目 创建一个Scrapy项目,并将其命名为“demo” scrapy startproject demo cd demo 稍等片刻后,Scr ...

  6. python3.4学习笔记(十七) 网络爬虫使用Beautifulsoup4抓取内容

    python3.4学习笔记(十七) 网络爬虫使用Beautifulsoup4抓取内容 Beautiful Soup 是用Python写的一个HTML/XML的解析器,它可以很好的处理不规范标记并生成剖 ...

  7. java之jvm学习笔记六-十二(实践写自己的安全管理器)(jar包的代码认证和签名) (实践对jar包的代码签名) (策略文件)(策略和保护域) (访问控制器) (访问控制器的栈校验机制) (jvm基本结构)

    java之jvm学习笔记六(实践写自己的安全管理器) 安全管理器SecurityManager里设计的内容实在是非常的庞大,它的核心方法就是checkPerssiom这个方法里又调用 AccessCo ...

  8. Learning ROS for Robotics Programming Second Edition学习笔记(六) indigo xtion pro live

    中文译著已经出版,详情请参考:http://blog.csdn.net/ZhangRelay/article/category/6506865 Learning ROS for Robotics Pr ...

  9. Typescript 学习笔记六:接口

    中文网:https://www.tslang.cn/ 官网:http://www.typescriptlang.org/ 目录: Typescript 学习笔记一:介绍.安装.编译 Typescrip ...

随机推荐

  1. vue中import引入模块路径中@符号是什么意思

    在编写vue文件中引入模块 import model from "@/common/model"; 这里路径前面的“@”符号表示什么意思? resolve: { // 自动补全的扩 ...

  2. 如何只安装Postgresql client(以9.4 为例)

    Install the repository RPM: yum install https://download.postgresql.org/pub/repos/yum/9.4/redhat/rhe ...

  3. OpenGL.Tutorial15_Lightmaps

    ZC:撤销 & 重做 — Blender Manual.html(https://docs.blender.org/manual/zh-hans/dev/interface/undo_redo ...

  4. redis缓存中间件基础

    前序: 默认使用SimpleCacheConfiguration 组件ConcurrentMapCacheManager==ConcurrentMapCache将数据保存在ConcurrentMap& ...

  5. jquary 选择器,dom操作知识点

    选择器: 1. 基本选择器 1. 标签选择器(元素选择器) * 语法: $("html标签名") 获得所有匹配标签名称的元素 2. id选择器 * 语法: $("#id的 ...

  6. Linux中python3,django,redis以及mariab的安装

    1. Linux中python3,django,redis以及mariab的安装 2. CentOS下编译安装python3 编译安装python3.6的步骤 1.下载python3源码包 wget ...

  7. css实现div左侧突出一个带边框的三角形

    .vip-control-header{ width: 600px; height: auto; background: #F8F8F8; border: 1px solid #e2e2e2; pad ...

  8. MongDB 批量更新

    最近公司在使用mongodb.  批量更新的语句为: db.table.update(  {'wo': {$in: [ "123", "456"]}}, {$s ...

  9. form 表单提交、后台的统一处理

    配合 form 提交后台 /ajaxSubmit/Submit等通过form提交springMvc下@RequestMapping("/save_oaflow_init")//Re ...

  10. hbase-0.92.1表备份还原

    原表结构和数据 hbase(main):021:0* describe 'test' DESCRIPTION ENABLED {NAME => ', TTL = true > ', COM ...