现在用下面这个案例来演示如果爬取安居客上面深圳的租房信息,我们采取这样策略,首先爬取所有租房信息的链接地址,然后再根据爬取的地址获取我们所需要的页面信息。访问次数多了,会被重定向到输入验证码页面,这个问题后面有几种策略解决。

如果还不知道怎么去安装部署scrapy的参考我的另外一篇文章《快速部署网络爬虫框架scrapy》

1. 创建项目:

  进入项目路径,使用命令 scrapy startproject anjuke_urls

  进入项目路径,使用命令 scrapy startproject anjuke_zufang

2. 创建爬虫文件:

  进入项目anjuke_urls的spider路径,使用命令 scrapy genspider anjuke_urls https://sz.zu.anjuke.com/

3. 爬虫anjuke_urls代码:

  anjuke_urls.py

 # -*- coding: utf-8 -*-
import scrapy
from ..items import AnjukeUrlsItem class AnjukeGeturlsSpider(scrapy.Spider):
name = 'anjuke_getUrls' start_urls = ['https://sz.zu.anjuke.com/'] def parse(self, response):
# 实例化类对象
mylink = AnjukeUrlsItem()
# 获取所有的链接列表
links = response.xpath("//div[@class='zu-itemmod']/a/@href | //div[@class='zu-itemmod ']/a/@href").extract() # 提取当前页面的所有租房链接
for link in links:
mylink['url'] = link yield mylink # 判断下一页是否能够点击,如果可以,继续并发送请求处理,直到链接全部提取完
if len(response.xpath("//a[@class = 'aNxt']")) != 0:
yield scrapy.Request(response.xpath("//a[@class = 'aNxt']/@href").extract()[0], callback= self.parse)

  items.py

 # -*- coding: utf-8 -*-

 # Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html import scrapy class AnjukeUrlsItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# 租房链接
url = scrapy.Field()

  pipelines.py

 # -*- coding: utf-8 -*-

 # Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html class AnjukeUrlsPipeline(object):
# 以写入的方式打开文件link.txt,不存在会新建一个
def open_spider(self, spider):
self.linkFile = open('G:\Python\网络爬虫\\anjuke\data\link.txt', 'w', encoding='utf-8') # 将获取的所有url写进文件
def process_item(self, item, spider):
self.linkFile.writelines(item['url'] + "\n")
return item # 关闭文件
def close_spider(self, spider):
self.linkFile.close()

  setting.py

 # -*- coding: utf-8 -*-

 # Scrapy settings for anjuke_urls project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html BOT_NAME = 'anjuke_urls' SPIDER_MODULES = ['anjuke_urls.spiders']
NEWSPIDER_MODULE = 'anjuke_urls.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36' # Obey robots.txt rules
ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#} # Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'anjuke_urls.middlewares.AnjukeUrlsSpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'anjuke_urls.middlewares.MyCustomDownloaderMiddleware': 543,
#} # Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'anjuke_urls.pipelines.AnjukeUrlsPipeline': 300,
} # Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
  middlewares.py
 # -*- coding: utf-8 -*-

 # Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals class AnjukeUrlsSpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects. @classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider. # Should return None or raise an exception.
return None def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response. # Must return an iterable of Request, dict or Item objects.
for i in result:
yield i def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception. # Should return either None or an iterable of Response, dict
# or Item objects.
pass def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated. # Must return only requests (not items).
for r in start_requests:
yield r def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)

4. 爬虫anjuke_zufang代码:

  anjuke_zufang.py

 # -*- coding: utf-8 -*-
import scrapy
from ..items import AnjukeZufangItem class AnjukeZufangSpider(scrapy.Spider):
name = 'anjuke_zufang'
# 初始化一个空列表
start_urls = []
custom_settings = {'DOWNLOAD_DELAY' : 3}
 # 初始化start_urls,一定需要这一步 

   def __init__(self):  links = open('G:\Python\网络爬虫\\anjuke\data\link.txt')  for line in links:  # 去掉换行符,如果有换行符则无法访问网址  line = line[:-1]   self.start_urls.append(line)   def parse(self, response):  item = AnjukeZufangItem()  # 直接获取页面我们需要的数据  item['roomRent'] = response.xpath('//span[@class = "f26"]/text()').extract()[0]  item['rentMode'] = response.xpath('//div[@class="pinfo"]/div/div/div[1]/dl[2]/dd/text()').extract()[0].strip()  item['roomLayout'] = response.xpath('//div[@class="pinfo"]/div/div/div[1]/dl[3]/dd/text()').extract()[0].strip()  item['roomSize'] = response.xpath('//div[@class="pinfo"]/div/div/div[2]/dl[3]/dd/text()').extract()[0]  item['LeaseMode'] = response.xpath('//div[@class="pinfo"]/div/div/div[1]/dl[4]/dd/text()').extract()[0]  item['apartmentName'] = response.xpath('//div[@class="pinfo"]/div/div/div[1]/dl[5]/dd/a/text()').extract()[0]  item['location1'] = response.xpath('//div[@class="pinfo"]/div/div/div[1]/dl[6]/dd/a/text()').extract()[0]  item['location2'] = response.xpath('//div[@class="pinfo"]/div/div/div[1]/dl[6]/dd/a[2]/text()').extract()[0]  item['floor'] = response.xpath('//div[@class="pinfo"]/div/div/div[2]/dl[5]/dd/text()').extract()[0]  item['orientation'] = response.xpath('//div[@class="pinfo"]/div/div/div[2]/dl[4]/dd/text()').extract()[0].strip()  item['decorationSituation'] = response.xpath('//div[@class="pinfo"]/div/div/div[2]/dl[2]/dd/text()').extract()[0]  item['intermediaryName'] = response.xpath('//h2[@class="f16"]/text()').extract()[0]  item['intermediaryPhone'] = response.xpath('//p[@class="broker-mobile"]/text()').extract()[0]  item['intermediaryCompany'] = response.xpath('//div[@class="broker-company"]/p[1]/a/text()').extract()[0]  item['intermediaryStore'] = response.xpath('//div[@class="broker-company"]/p[2]/a/text()').extract()[0]  item['link'] = response.url   yield item

  items.py

 # -*- coding: utf-8 -*-

 # Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html import scrapy class AnjukeZufangItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# 租金
roomRent = scrapy.Field()
# 租金压付方式
rentMode = scrapy.Field()
# 房型
roomLayout = scrapy.Field()
# 面积
roomSize = scrapy.Field()
# 租赁方式
LeaseMode = scrapy.Field()
# 所在小区
apartmentName = scrapy.Field()
# 位置
location1 = scrapy.Field()
location2 = scrapy.Field()
# 楼层
floor = scrapy.Field()
# 朝向
orientation = scrapy.Field()
# 装修
decorationSituation = scrapy.Field()
# 中介名字
intermediaryName = scrapy.Field()
# 中介电话
intermediaryPhone = scrapy.Field()
# 中介公司
intermediaryCompany = scrapy.Field()
# 中介门店
intermediaryStore = scrapy.Field()
# 房屋链接
link = scrapy.Field()

  pipelines.py

 # -*- coding: utf-8 -*-

 # Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html import sqlite3 class AnjukeZufangPipeline(object):
def open_spider(self, spider):
self.file = open('G:\Python\网络爬虫\\anjuke\data\租房信息.txt', 'w', encoding='utf-8')
14
def process_item(self, item, spider):
self.file.write(
item['roomRent'] + "," + item['rentMode'] + "," + item['roomLayout'] + "," + item['roomSize'] + "," + item[
'LeaseMode'] + "," + item['apartmentName'] + "," + item['location1'] + " " + item['location2'] + "," + item[
'floor'] + "," + item['orientation'] + "," + item['decorationSituation'] + "," + item['intermediaryName'] +
"," + item['intermediaryPhone'] + "," + item['intermediaryCompany'] + "," + item['intermediaryStore'] + ","
+ item['link'] + '\n') return item
def spoder_closed(self, spider):
self.file.close()

  setting.py

 # -*- coding: utf-8 -*-

 # Scrapy settings for anjuke_zufang project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html BOT_NAME = 'anjuke_zufang' SPIDER_MODULES = ['anjuke_zufang.spiders']
NEWSPIDER_MODULE = 'anjuke_zufang.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36' # Obey robots.txt rules
ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default)
#COOKIES_ENABLED = False # Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False # Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#} # Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'anjuke_zufang.middlewares.AnjukeZufangSpiderMiddleware': 543,
#} # Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'anjuke_zufang.middlewares.MyCustomDownloaderMiddleware': 543,
#} # Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#} # Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'anjuke_zufang.pipelines.AnjukeZufangPipeline': 300,
} # Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

  middlewares.py

  

 # -*- coding: utf-8 -*-

 # Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals class AnjukeZufangSpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects. @classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider. # Should return None or raise an exception.
return None def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response. # Must return an iterable of Request, dict or Item objects.
for i in result:
yield i def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception. # Should return either None or an iterable of Response, dict
# or Item objects.
pass def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated. # Must return only requests (not items).
for r in start_requests:
yield r def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)

5. 依次运行爬虫:

  进入爬虫anjuke_urls项目,运行scrapy crawl anjuke_getUrls

  进入爬虫anjuke_urls项目,运行scrapy crawl anjuke_zufang

如何使用Scrapy框架实现网络爬虫的更多相关文章

  1. 使用 Scrapy 构建一个网络爬虫

    来自weixin 记得n年前项目需要一个灵活的爬虫工具,就组织了一个小团队用Java实现了一个爬虫框架,可以根据目标网站的结构.地址和需要的内容,做简单的配置开发,即可实现特定网站的爬虫功能.因为要考 ...

  2. 使用Scrapy构建一个网络爬虫

    记得n年前项目需要一个灵活的爬虫工具,就组织了一个小团队用Java实现了一个爬虫框架,可以根据目标网站的结构.地址和需要的内容,做简单的配置开发,即可实现特定网站的爬虫功能.因为要考虑到各种特殊情形, ...

  3. Scrapy 轻松定制网络爬虫(转)

    网络爬虫(Web Crawler, Spider)就是一个在网络上乱爬的机器人.当然它通常并不是一个实体的机器人,因为网络本身也是虚拟的东西,所以这个“机器人”其实也就是一段程序,并且它也不是乱爬,而 ...

  4. python学习之-用scrapy框架来创建爬虫(spider)

    scrapy简单说明 scrapy 为一个框架 框架和第三方库的区别: 库可以直接拿来就用, 框架是用来运行,自动帮助开发人员做很多的事,我们只需要填写逻辑就好 命令: 创建一个 项目 : cd 到需 ...

  5. Scrapy框架——CrawlSpider类爬虫案例

    Scrapy--CrawlSpider Scrapy框架中分两类爬虫,Spider类和CrawlSpider类. 此案例采用的是CrawlSpider类实现爬虫. 它是Spider的派生类,Spide ...

  6. 一个基于Scrapy框架的pixiv爬虫

    源码 https://github.com/vicety/Pixiv-Crawler,功能什么的都在这里介绍了 说几个重要的部分吧 登录部分 困扰我最久的部分,网上找的其他pixiv爬虫的登录方式大多 ...

  7. scrapy框架修改单个爬虫的配置,包括下载延时,下载超时设置

    在一个框架里面有多个爬虫时,每个爬虫的需求不相同,例如,延时的时间,所以可以在这里配置一下custom_settings = {},大括号里面写需要修改的配置,然后就能把settings里面的配置给覆 ...

  8. python基于scrapy框架的反爬虫机制破解之User-Agent伪装

    user agent是指用户代理,简称 UA. 作用:使服务器能够识别客户使用的操作系统及版本.CPU 类型.浏览器及版本.浏览器渲染引擎.浏览器语言.浏览器插件等. 网站常常通过判断 UA 来给不同 ...

  9. 基于scrapy框架的分布式爬虫

    分布式 概念:可以使用多台电脑组件一个分布式机群,让其执行同一组程序,对同一组网络资源进行联合爬取. 原生的scrapy是无法实现分布式 调度器无法被共享 管道无法被共享 基于 scrapy+redi ...

随机推荐

  1. python,类和对象(一)

    万物皆对象,在python中也存在对象,首先我们需要知道面向对象的三大特征封装.继承.多态. 封装就是将一种或多种杂乱无序的代码进行有序的分类封装. 继承可以理解为孩子会继承父亲所有的东西. 多态可以 ...

  2. 20165231 预备作业二:学习基础和C语言基础调查

    微信文章感想 读了娄老师微信公众号中的文章,老师给我们的启示首先就是要坚持,万事开头难,但是只要肯坚持就一定会有所成就,不管是学习还是生活方面.其中最有触动的就是减肥了,是我三四年来一直难以完成的目标 ...

  3. python 获取本机IP的三种方式

    python获取本机IP的方式 第一种: #!/usr/bin/python import socket import fcntl import struct def get_ip_address(i ...

  4. React组件State提升(译)

    译自:https://reactjs.org/docs/lifting-state-up.html (适当进行了裁减) 通常我们会碰到这样的情况,当某个组件的state数据改变时,几个React组件同 ...

  5. nginx反向代理解决跨域

    nginx作为反向代理服务器,就是把http请求转发到另一个或者一些服务器上.通过把本地一个url前缀映射到要跨域访问的web服务器上,就可以实现跨域访问.对于浏览器来说,访问的就是同源服务器上的一个 ...

  6. Python3学习笔记05-数字

    Python 数字数据类型用于存储数值 数字类型不能修改,如果改变数字数据类型的值,将重新分配内存空间 以下实例在变量赋值时 Number 对象将被创建: var1 = 10 var2 = 20 也可 ...

  7. Random() 插入数据重复的问题

    今天在写一个小测试用例时,想在数据库插入一些数据.造数据时用到了Random函数,之前没有注意到这个问题,看到“Random生成随机数重复的问题”才注意到自己插入的数据有重复. ; i < ; ...

  8. 【转】C++拷贝构造函数详解

    一.什么是拷贝构造函数 首先对于普通类型的对象来说,它们之间的复制是很简单的,例如: ; int b=a; 而类对象与普通对象不同,类对象内部结构一般较为复杂,存在各种成员变量. 下面看一个类对象拷贝 ...

  9. php markdown 接口文档生成工具 SummerDoc

    2017年9月18日 19:20:22 星期一 因工作需要, 用PHP写了一个管理接口文档的小工具, 下边介绍一下: 浏览器展示的效果: 项目地址:(码云) 例子(http://doc.hearu.t ...

  10. 安娜Anna:世界最快的超级伸缩的KVS, 秒杀Redis

    伯克利 这个大学在计算机学术界.工业界的地位举足轻重,其中的AMP实验室曾开发出了一大批大获成功. 对计算机行业产生深远影响的分布式计算技术,包括 Spark.Mesos.Tachyon 等.作为AM ...