测试环境:

win10,单机爬取,scrapy1.5.0,python3.6.4,mongodb,Robo 3T

其他准备:

代理池:测试环境就没有用搭建的flask抓代理,因为我找到的几个免费网站有效ip不够多,因此从xxx网站批量获取了800多个免费https代理,然后开线程池测试访问58同城网站,将有效ip保存到json文本中,在scrapy代码加proxy中间件,每次从json中random一个代理;

请求头:网上搜集各种网站的User-Agent,在scrapy中加UserAgent中间件,每次请求random一个UserAgent;

settings.py:

BOT_NAME = 'oldHouse'
SPIDER_MODULES = ['oldHouse.spiders']
NEWSPIDER_MODULE = 'oldHouse.spiders'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY=1
RETRY_TIMES = 8
MONGO_URI = 'localhost'
MONGO_DATABASE = 'old58House'
ITEM_PIPELINES = {
'oldHouse.pipelines.MongoPipeline': 300,
}
DOWNLOADER_MIDDLEWARES = {
'oldHouse.middlewares.OldhouseDownloaderMiddleware': 543,
'oldHouse.middlewares.MyProxyMiddleWare': 542,
'oldHouse.middlewares.MyUserAgentMiddleWare': 541,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': None,
'oldHouse.middlewares.MyRedirectMiddleware': 601,
'oldHouse.middlewares.MyRetryMiddleware': 551,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}

以下所有分析中:

real_url表示58同城url链接中给的正确url,如https://bj.58.com/ershoufang/37786966127392x.shtml

fake_url表示58同城url链接中含'zd_p'的url,需要我们对它进行跳转,跳到real_url,如https://short.58.com/zd_p/887076ce-1bfa-4142-ae0f-59c079a078e9/

jump_url表示由fake_url跳转到的url,它是获取到real_url的桥梁,如

firewall表示58同城服务器上的一个验证url,如GET https://callback.58.com/firewall/verifycode?......

一、在爬取过程中,出现以下情形:

1)real_url -> firewall - > firewall -> firewall -> 重试过多,死掉。给定正确url,由于ip频繁访问,跳到58频繁验证的url,由于没有写模拟验证代码,重试两次之后放弃该url

案例:
2019-04-16 14:19:08 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://callback.58.com/firewall/verifycode?serialId=5167d73136b2b181a1f31897773da5fa_df9c5d69d8f64ab7acbd93658f644092&code=22&sign=9
0346b3cf6733d799b204c2fdb508612&namespace=ershoufangphp&url=https%3A%2F%2Fbj.58.com%2Fershoufang%2F37786966127392x.shtml> from <GET https://bj.58.com/ershoufang/37786966127392x.shtml> 2019-04-16 14:19:18 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://callback.58.com/firewall/verifycode?serialId=5167d73136b2b181a1f31897773da5fa_df9c5d69d8f64ab7acbd93658f644092&code=22&sign=90346b3cf6733d79
9b204c2fdb508612&namespace=ershoufangphp&url=https%3A%2F%2Fbj.58.com%2Fershoufang%2F37786966127392x.shtml> (failed 1 times): An error occurred while connecting: [Failure instance: Traceback (failure with no frames): <class 'tw
isted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion: Connection lost. 2)real_url -> firewall -> firewall,拿到firewall的页面信息 -> 由于拿到错误页面,在做数据提取时出现NoneType error报错 案例:
2019-04-16 14:18:49 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://callback.58.com/firewall/verifycode?serialId=fa0b4cbd0ad45dfd70b236d523d35fe4_4766f82648964a8190d624a446194d0b&code=22&sign=3
6be5e04f16ed03203be421da14859a9&namespace=ershoufangphp&url=https%3A%2F%2Fbj.58.com%2Fershoufang%2F37785831004063x.shtml> from <GET https://bj.58.com/ershoufang/37785831004063x.shtml> 2019-04-16 14:18:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://callback.58.com/firewall/verifycode?serialId=fa0b4cbd0ad45dfd70b236d523d35fe4_4766f82648964a8190d624a446194d0b&code=22&sign=36be5e04f16ed03203be421da14
859a9&namespace=ershoufangphp&url=https%3A%2F%2Fbj.58.com%2Fershoufang%2F37785831004063x.shtml> (referer: https://bj.58.com/ershoufang/) 2019-04-16 14:18:52 [scrapy.core.scraper] ERROR: Spider error processing <GET https://callback.58.com/firewall/verifycode?serialId=fa0b4cbd0ad45dfd70b236d523d35fe4_4766f82648964a8190d624a446194d0b&code=22&sign=36be5e04f16ed032
03be421da14859a9&namespace=ershoufangphp&url=https%3A%2F%2Fbj.58.com%2Fershoufang%2F37785831004063x.shtml> (referer: https://bj.58.com/ershoufang/)
Traceback (most recent call last): 3)fake_url -> jump_url -> jump_url -> jump_url放弃url 案例:
2019-04-16 16:24:18 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://jing.58.com/adJump?adType=3&target=pZwY0ZnlszqBpB3draOWUvYfugF1pAqduh78uzt1P1mdnjbYrjEdnHDknL980v6YUyk_uaYYm191nH-hPiYvnWmYsH
whrHNVryF6nBdWmWFBmWb3mvNLuAn_nHDQP1bOnWDYnHcLP1DQPjnvrak_FhQfuvIGU-qd0vRzgv-b5HThuA-107qWmgw-5HDzFhwG0LKxUAqWmykqniuWUA--UMwxIgP-0-qGujYhuyOYpgwOpyEqn10vPHTOPj9YPHDQnjnhIgP-0h-b5HmQnHmOnHn1nHnYPWDQFh-VuybqFhR8IA-YXgwO0ANqnau-
UMwGIZ-xmv7YuHYhuyOYpgwOgvQfmv7_5iubpgPkgLwGUyNqnHNdPHEkn1T1PWckPaukULPGIA-fUWYzriuWUA-Wpv-b5H9OnWnkPhcOsHNYrHDVPAPBuid6mHFWsH9QuyNYuy7bnvw-raukmgF6UHYQnj0LrWTh0AQ6IAuf0hYqsHDhUA-1IZGb5yQG0LEV0A7zmydLuy-MpZEV0Anh0A7MuRqYXgK-5H
D_nHnh0ZFfuZRWIA-b5HDknau1UAqYpyEqnHTknjcdrauYpyEqmyDQnyNOP1bVuW9QPaYYPAEQsHbQm1bVuHNOmvDdmWb3rymQ> from <GET https://short.58.com/zd_p/892306b9-5491-4cbe-aa2c-81ee4ead3de8/> 2019-04-16 16:24:52 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://jing.58.com/adJump?adType=3&target=pZwY0ZnlszqBpB3draOWUvYfugF1pAqduh78uzt1P1mdnjbYrjEdnHDknL980v6YUyk_uaYYm191nH-hPiYvnWmYsHwhrHNVryF6nBdWm
WFBmWb3mvNLuAn_nHDQP1bOnWDYnHcLP1DQPjnvrak_FhQfuvIGU-qd0vRzgv-b5HThuA-107qWmgw-5HDzFhwG0LKxUAqWmykqniuWUA--UMwxIgP-0-qGujYhuyOYpgwOpyEqn10vPHTOPj9YPHDQnjnhIgP-0h-b5HmQnHmOnHn1nHnYPWDQFh-VuybqFhR8IA-YXgwO0ANqnau-UMwGIZ-xmv7YuHY
huyOYpgwOgvQfmv7_5iubpgPkgLwGUyNqnHNdPHEkn1T1PWckPaukULPGIA-fUWYzriuWUA-Wpv-b5H9OnWnkPhcOsHNYrHDVPAPBuid6mHFWsH9QuyNYuy7bnvw-raukmgF6UHYQnj0LrWTh0AQ6IAuf0hYqsHDhUA-1IZGb5yQG0LEV0A7zmydLuy-MpZEV0Anh0A7MuRqYXgK-5HD_nHnh0ZFfuZRWI
A-b5HDknau1UAqYpyEqnHTknjcdrauYpyEqmyDQnyNOP1bVuW9QPaYYPAEQsHbQm1bVuHNOmvDdmWb3rymQ> (failed 1 times): TCP connection timed out: 10060: 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。. 2019-04-16 16:24:59 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://jing.58.com/adJump?adType=3&target=pZwY0ZnlszqBpB3draOWUvYfugF1pAqduh78uzt1P1mdnjbYrjEdnHDknL980v6YUyk_uaYYm191nH-hPiYvnWmYsHwhrHNVryF6nBdWm
WFBmWb3mvNLuAn_nHDQP1bOnWDYnHcLP1DQPjnvrak_FhQfuvIGU-qd0vRzgv-b5HThuA-107qWmgw-5HDzFhwG0LKxUAqWmykqniuWUA--UMwxIgP-0-qGujYhuyOYpgwOpyEqn10vPHTOPj9YPHDQnjnhIgP-0h-b5HmQnHmOnHn1nHnYPWDQFh-VuybqFhR8IA-YXgwO0ANqnau-UMwGIZ-xmv7YuHY
huyOYpgwOgvQfmv7_5iubpgPkgLwGUyNqnHNdPHEkn1T1PWckPaukULPGIA-fUWYzriuWUA-Wpv-b5H9OnWnkPhcOsHNYrHDVPAPBuid6mHFWsH9QuyNYuy7bnvw-raukmgF6UHYQnj0LrWTh0AQ6IAuf0hYqsHDhUA-1IZGb5yQG0LEV0A7zmydLuy-MpZEV0Anh0A7MuRqYXgK-5HD_nHnh0ZFfuZRWI
A-b5HDknau1UAqYpyEqnHTknjcdrauYpyEqmyDQnyNOP1bVuW9QPaYYPAEQsHbQm1bVuHNOmvDdmWb3rymQ> (failed 2 times): Could not open CONNECT tunnel with proxy 104.236.248.219:3128 [{'status': 503, 'reason': b'Service Unavailable'}] 2019-04-16 16:25:05 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://jing.58.com/adJump?adType=3&target=pZwY0ZnlszqBpB3draOWUvYfugF1pAqduh78uzt1P1mdnjbYrjEdnHDknL980v6YUyk_uaYYm191nH-hPiYvnWmYsHwhrHNVryF6nBdWm
WFBmWb3mvNLuAn_nHDQP1bOnWDYnHcLP1DQPjnvrak_FhQfuvIGU-qd0vRzgv-b5HThuA-107qWmgw-5HDzFhwG0LKxUAqWmykqniuWUA--UMwxIgP-0-qGujYhuyOYpgwOpyEqn10vPHTOPj9YPHDQnjnhIgP-0h-b5HmQnHmOnHn1nHnYPWDQFh-VuybqFhR8IA-YXgwO0ANqnau-UMwGIZ-xmv7YuHY
huyOYpgwOgvQfmv7_5iubpgPkgLwGUyNqnHNdPHEkn1T1PWckPaukULPGIA-fUWYzriuWUA-Wpv-b5H9OnWnkPhcOsHNYrHDVPAPBuid6mHFWsH9QuyNYuy7bnvw-raukmgF6UHYQnj0LrWTh0AQ6IAuf0hYqsHDhUA-1IZGb5yQG0LEV0A7zmydLuy-MpZEV0Anh0A7MuRqYXgK-5HD_nHnh0ZFfuZRWI
A-b5HDknau1UAqYpyEqnHTknjcdrauYpyEqmyDQnyNOP1bVuW9QPaYYPAEQsHbQm1bVuHNOmvDdmWb3rymQ> (failed 3 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clea
n fashion: Connection lost.>] 4)fake_url -> jump_url -> real_url - > firewal难得拿到real_url,又因为请求频繁等碰墙上了 案例:
2019-04-16 14:19:01 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://jing.58.com/adJump?adType=3&target=pZwY0ZnlszqBpB3draOWUvYfugF1pAqduh78uzt1P10LP1DLP10znHcYnM980v6YUyk_uadhnAn1nhFhnaY3nH6-sH
wWuAnVmWEvriYzmHP6PvwWuyRWmhn_nHDQP10zrjnzPW0QnHTknHT3rak_FhQfuvIGU-qd0vRzgv-b5HThuA-107qWmgw-5HDzFhwG0LKxUAqWmykqniuWUA--UMwxIgP-0-qGujYhuyOYpgwOpyEqn10LP10QP10LnWDzPjchIgP-0h-b5HDzrHDvrjnOrjbzPj9vFh-VuybqFhR8IA-YXgwO0ANqnau-
UMwGIZ-xmv7YuHYhuyOYpgwOgvQfmv7_5iubpgPkgLwGUyNqnHNdPHnOPHNznjbdnBukULPGIA-fUWY1rauWUA-Wpv-b5H93P1TLPhP-sH7BuhDVPjDYnBd6uHKhsHNOm1TLryDkP16-riukmgF6UHYQnj0LrWTh0AQ6IAuf0hYqsHDhUA-1IZGb5yQG0LEV0A7zmydLuy-MpZEV0Anh0A7MuRqYXgK-5H
D_nHnh0ZFfuZRWIA-b5HDknau1UAqYpyEqnHTknjcdrauYpyEqnAEOnyPWrynVrAN3PaYYmvnksyDvuhcVrHTvPjP-m1czPWRh> from <GET https://short.58.com/zd_p/887076ce-1bfa-4142-ae0f-59c079a078e9/> 2019-04-16 14:19:12 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://bj.58.com/ershoufang/37777177721242x.shtml?adtype=3> from <GET https://jing.58.com/adJump?adType=3&target=pZwY0ZnlszqBpB3draO
WUvYfugF1pAqduh78uzt1P10LP1DLP10znHcYnM980v6YUyk_uadhnAn1nhFhnaY3nH6-sHwWuAnVmWEvriYzmHP6PvwWuyRWmhn_nHDQP10zrjnzPW0QnHTknHT3rak_FhQfuvIGU-qd0vRzgv-b5HThuA-107qWmgw-5HDzFhwG0LKxUAqWmykqniuWUA--UMwxIgP-0-qGujYhuyOYpgwOpyEqn10LP
10QP10LnWDzPjchIgP-0h-b5HDzrHDvrjnOrjbzPj9vFh-VuybqFhR8IA-YXgwO0ANqnau-UMwGIZ-xmv7YuHYhuyOYpgwOgvQfmv7_5iubpgPkgLwGUyNqnHNdPHnOPHNznjbdnBukULPGIA-fUWY1rauWUA-Wpv-b5H93P1TLPhP-sH7BuhDVPjDYnBd6uHKhsHNOm1TLryDkP16-riukmgF6UHYQnj0
LrWTh0AQ6IAuf0hYqsHDhUA-1IZGb5yQG0LEV0A7zmydLuy-MpZEV0Anh0A7MuRqYXgK-5HD_nHnh0ZFfuZRWIA-b5HDknau1UAqYpyEqnHTknjcdrauYpyEqnAEOnyPWrynVrAN3PaYYmvnksyDvuhcVrHTvPjP-m1czPWRh> 2019-04-16 14:19:18 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://callback.58.com/firewall/verifycode?serialId=75196bd68f771f168bdbcaa7e8a97a6b_f35824ea81fa488aa5e974355cd785da&code=22&sign=7
1677e2c4c84c2db8421e233411db814&namespace=ershoufangphp&url=https%3A%2F%2Fbj.58.com%2Fershoufang%2F37777177721242x.shtml%3Fadtype%3D3> from <GET https://bj.58.com/ershoufang/37777177721242x.shtml?adtype=3>

二、针对以上情形的解决办法

总思路:

由于爬取这些数据无需登录,那么针对58firewall给的较难破解的轨迹验证方式我们换个ip就好了;

观察没有成功访问的原因主要是出在redirect和retry上,对于retry,由于我管理的代理有效率并不高,并且没在用flask维护实时代理,因此我会给更大的RETRY_TIMES;对于redirect,可以看到以上有各种url之间的redirect,必然要用上redirect中间件,并且根据不同类型的redirect做不同的process_response,下面详细解决redirect问题

解决工作:

简单概括以上四种redirect:
1.real_url -> firewall - > firewall -> firewall -> 重试过多,死掉
原因在于请求过于频繁,且设置了允许重定向,导致到了firewall而不是重新爬real_url 2.real_url -> firewall -> firewall,拿到firewall的页面信息 -> 由于拿到错误页面,在做数据提取时出现NoneType error
原因在于请求过于频繁,且设置了允许重定向,导致到了firewall而不是重新爬real_url 3.fake_url -> jump_url -> jump_url -> jump_url放弃url
极有可能是代理原因导致不停重试 4.fake_url -> jump_url -> real_url - > firewal,难得拿到real_url,又因为请求频繁等碰墙上了
从fake_url终于重定向到real_url之后仍有可能由于请求频发导致撞墙,出现第1中情形 逐个分析办法:
若直接settings.py设置REDIRECT_ENABLED=False就好了,那是不行滴,如情形4,居然能从fake_url跳跳跳一直跳到我们需要的real_url,这就是58同城设的套啊 1和2情形的方案:
1和2自从real_url跳到firewall后就偏离了我们的工作,那么针对real_url我不让它跳转就行了,若当前是real_url则在scrapy.Request中设置dont_redirect=True(默认False),但是这还没完,real_url说你不让我跳转却又给我分配了一个垃圾IP,强行让我撞墙,撞了墙又不处理一下,好比是这样,小强说帮小明打架,结果小强根本没去,小明被迫1v5被打得鼻青脸肿,小强正在家里快乐风男。这样的结果是对于本次redirect没有后续处理,日志出现debug:
2019-04-17 08:10:06 [scrapy.core.engine] DEBUG: Crawled (302) <GET https://bj.58.com/ershoufang/> (referer: https://bj.58.com/ershoufang/)
2019-04-17 08:10:06 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <302 https://bj.58.com/ershoufang/>: HTTP status code is not handled or not allowed
2019-04-17 08:10:06 [scrapy.core.engine] INFO: Closing spider (finished)
如果这个real_url出现在后续url中还好,如上所示,real_url出现在初始url,导致第一个url都没爬到就直接关闭爬虫,gg。
那么如何处理呢?干脆不要针对real_url设置dont_redirect=True了,保持默认全局允许重定向就好,自定义MyRedirectMiddleware,完整继承RedirectMiddleware方法下,提供监测机制,检测位置:real_url -> firewall,捕捉这个real_url,在他想跳还没跳起来之前return Request(real_url...)就好,这还没完,由于real_url开始是爬取过的,在finger_print中是有记录的,那么记得加参数dont_filter=True,并且记得加callback=spider.parse_xxx 情形3方案:
自定义MyRedirectMiddleware,完整继承RedirectMiddleware方法下,提供监测机制,检测位置:fake_url -> jump_url,若发现当前跳转到的目标url为jump_url,就提供更多的重试次数,由于设置好了代理中间件,基本能保证最终拿到real_url了。 情形4方案:
自定义MyRedirectMiddleware,完整继承RedirectMiddleware方法下,提供监测机制,检测位置:jump_url -> real_url,若发现当前跳转到的目标url非jump_url或firewall,则基本确定获取到real_url了,那么就让重定向到real_url上就好了。
说了这么多,好消息是我们不用管情形4了,jump_url - > real_url部分由于全局允许重定向,并且在情形3jump_url设置了不停的retry,是一定能拿到real_url的,而real_url - > firewall部分不正是情况1所要解决的吗,所以情形4方案迎刃而解。

具体方案代码:选自redirect中间件部分代码

# -*- coding:utf-8 -*-
# Author: Tarantiner
# @Time :2019/4/17 18:26 class MyRedirectMiddleware(BaseRedirectMiddleware):
def process_response(self, request, response, spider):
if (request.meta.get('dont_redirect', False) or
response.status in getattr(spider, 'handle_httpstatus_list', []) or
response.status in request.meta.get('handle_httpstatus_list', []) or
request.meta.get('handle_httpstatus_all', False)):
return response allowed_status = (301, 302, 303, 307, 308) if 'Location' not in response.headers or response.status not in allowed_status:
return response location = safe_url_string(response.headers['location']) redirected_url = urljoin(request.url, location) if response.status in (301, 307, 308) or request.method == 'HEAD':
redirected = request.replace(url=redirected_url)
return self._redirect(redirected, request, spider, response.status) if 'firewall' in redirected_url:
# 为防止1、2类情况:real_url -> firewall
return Request(response.url, callback=spider.parse_detail, dont_filter=True) if 'Jump' in redirected_url:
# 为防3类情况:fake_url -> jump_url -> jump_url -> jump_url放弃url
new_request = request.replace(url=redirected_url, method='GET', body='', meta={'max_retry_times': 12}) # 每次遇到这个跳转url都会加一次retry就是无线retry了 else:
new_request = self._redirect_request_using_get(request, redirected_url)
return self._redirect(new_request, request, spider, response.status)
解决后的爬取效果:
情形1、2的效果:real_url -> real_url -> 200,如下
redirected_url:
https://callback.58.com/firewall/verifycode?serialId=8b8b4a1ead5a3ded505d96dcc8e42004_21b60bb0e6194aeea99c0b42f0f99c2f&code=22&sign=f86bd444c70b93fc537503ef857276ec&namespace=ershoufangphp&url=https%3A%2F%2Fbj.58.com%2Fershoufang%2F37688560543505x.shtml
response_url:
https://bj.58.com/ershoufang/37688560543505x.shtml 2019-04-17 18:49:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://bj.58.com/ershoufang/37610200172685x.shtml?adtype=3> (referer: https://bj.58.com/ershoufang/)
2019-04-17 18:49:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bj.58.com/ershoufang/37610200172685x.shtml?adtype=3>
可见,在real_url -> firewall之后,并没有真正爬取firewall,而是继续爬取real_url,返回200 情形3的效果:fake_url -> jump_url -> jump_url -> real_url -> 200,如下
2019-04-16 22:43:27 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://jing.58.com/adJump?adType=3&target=pZwY0ZnlszqBpB3draOWUvYfugF1pAqduh78uzt1P1N3P1mzPjT1PHDknL980v6YUyk_uaY3PH6bmHwbmiY3PhDdsH
wBnHnVrHnzridbuHckPjPbmHmvP1N_nHDQn1cLPH9dP101n1bdPHN3Pak_FhQfuvIGU-qd0vRzgv-b5HThuA-107qWmgw-5HDzFhwG0LKxUAqWmykqniuWUA--UMwxIgP-0-qGujYhuyOYpgwOpyEqn10drj0vnWEkn1NQnjnhIgP-0h-b5Hmkn10QrHTvn1NznHnLFh-VuybqFhR8IA-YXgwO0ANqnau-
UMwGIZ-xmv7YuHYhuyOYpgwOgvQfmv7_5iubpgPkgLwGUyNqnHNdPHEzPH0OPHmLnBukULPGIA-fUWYOFhP_pyPopyEqnAmzuW0QnjNVn10kPiYYryF-sH6brynVmvDYmH0QPHEvPyRhFMK60h7V5HDkP10lnaukUA7YuhqzUHYVniu_pgPYXhEqUA-1IadkmgF6UgI-pyICIadkmzukmyI-gLwO0ANqni
kQnzuk0hqbIyPYpyEqnHTkFMP_ULwGujYQnjTknWN3FMwGujYvuHR6P1PWuiYkuyNdsHELryEVrAmOmzYzP101rAmkrju6PjN> from <GET https://short.58.com/zd_p/0f2f7105-3705-49be-8d9c-ca4a715465ef/> 2019-04-16 22:43:59 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://jing.58.com/adJump?adType=3&target=pZwY0ZnlszqBpB3draOWUvYfugF1pAqduh78uzt1P1N3P1mzPjT1PHDknL980v6YUyk_uaY3PH6bmHwbmiY3PhDdsHwBnHnVrHnzridbu
HckPjPbmHmvP1N_nHDQn1cLPH9dP101n1bdPHN3Pak_FhQfuvIGU-qd0vRzgv-b5HThuA-107qWmgw-5HDzFhwG0LKxUAqWmykqniuWUA--UMwxIgP-0-qGujYhuyOYpgwOpyEqn10drj0vnWEkn1NQnjnhIgP-0h-b5Hmkn10QrHTvn1NznHnLFh-VuybqFhR8IA-YXgwO0ANqnau-UMwGIZ-xmv7YuHY
huyOYpgwOgvQfmv7_5iubpgPkgLwGUyNqnHNdPHEzPH0OPHmLnBukULPGIA-fUWYOFhP_pyPopyEqnAmzuW0QnjNVn10kPiYYryF-sH6brynVmvDYmH0QPHEvPyRhFMK60h7V5HDkP10lnaukUA7YuhqzUHYVniu_pgPYXhEqUA-1IadkmgF6UgI-pyICIadkmzukmyI-gLwO0ANqnikQnzuk0hqbIyPYp
yEqnHTkFMP_ULwGujYQnjTknWN3FMwGujYvuHR6P1PWuiYkuyNdsHELryEVrAmOmzYzP101rAmkrju6PjN> (failed 1 times): TCP connection timed out: 10060: 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。 2019-04-16 22:44:17 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://bj.58.com/ershoufang/37587624035103x.shtml?adtype=3> from <GET https://jing.58.com/adJump?adType=3&target=pZwY0ZnlszqBpB3draO
WUvYfugF1pAqduh78uzt1P1N3P1mzPjT1PHDknL980v6YUyk_uaY3PH6bmHwbmiY3PhDdsHwBnHnVrHnzridbuHckPjPbmHmvP1N_nHDQn1cLPH9dP101n1bdPHN3Pak_FhQfuvIGU-qd0vRzgv-b5HThuA-107qWmgw-5HDzFhwG0LKxUAqWmykqniuWUA--UMwxIgP-0-qGujYhuyOYpgwOpyEqn10dr
j0vnWEkn1NQnjnhIgP-0h-b5Hmkn10QrHTvn1NznHnLFh-VuybqFhR8IA-YXgwO0ANqnau-UMwGIZ-xmv7YuHYhuyOYpgwOgvQfmv7_5iubpgPkgLwGUyNqnHNdPHEzPH0OPHmLnBukULPGIA-fUWYOFhP_pyPopyEqnAmzuW0QnjNVn10kPiYYryF-sH6brynVmvDYmH0QPHEvPyRhFMK60h7V5HDkP10
lnaukUA7YuhqzUHYVniu_pgPYXhEqUA-1IadkmgF6UgI-pyICIadkmzukmyI-gLwO0ANqnikQnzuk0hqbIyPYpyEqnHTkFMP_ULwGujYQnjTknWN3FMwGujYvuHR6P1PWuiYkuyNdsHELryEVrAmOmzYzP101rAmkrju6PjN> 2019-04-16 22:44:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://bj.58.com/ershoufang/37587624035103x.shtml?adtype=3> (referer: https://bj.58.com/ershoufang/)
2019-04-16 22:44:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bj.58.com/ershoufang/37587624035103x.shtml?adtype=3> 情形4的效果:fake_url -> jump_url -> real_url -> 200,如下
2019-04-16 22:43:33 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://jing.58.com/adJump?adType=3&target=pZwY0ZnlszqBpB3draOWUvYfugF1pAqduh78uzt1P10dPWTYPHT3n1T1nZ980v6YUyk_uaY3PH6bmHwbmiY3PhDdsH
wBnHnVrHnzridbuHckPjPbmHmvP1N_nHDQPWcvrjDkPj0vP1cdPjNzrak_FhQfuvIGU-qd0vRzgv-b5HThuA-107qWmgw-5HDzFhwG0LKxUAqWmykqniuWUA--UMwxIgP-0-qGujYhuyOYpgwOpyEqn10LPHmkPjNkrjnkn1ThIgP-0h-b5HN3P1Dkn1EOPWT1rjN3Fh-VuybqFhR8IA-YXgwO0ANqnau-
UMwGIZ-xmv7YuHYhuyOYpgwOgvQfmv7_5iubpgPkgLwGUyNqnHNdPHEzPH0OPHmLnBukULPGIA-fUWYdFhP_pyPopyEqmW76rHwbrjEVujb1mBYYnW66sHb1PjcVmWELujN1nH-Bm193FMK60h7V5HDkP10lnaukUA7YuhqzUHYVniu_pgPYXhEqUA-1IadkmgF6UgI-pyICIadkmzukmyI-gLwO0ANqni
kQnzuk0hqbIyPYpyEqnHTkFMP_ULwGujYQnjTknWN3FMwGujYvnjPWnHbvmzY3P1-6sHEvrjTVrywbuBYzmWNOuHw-nAmzPj9> from <GET https://short.58.com/zd_p/b1a94d84-d93b-428a-9342-b47d5319bc88/> 2019-04-16 22:43:40 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://bj.58.com/ershoufang/37756045083030x.shtml?adtype=3> from <GET https://jing.58.com/adJump?adType=3&target=pZwY0ZnlszqBpB3draO
WUvYfugF1pAqduh78uzt1P10dPWTYPHT3n1T1nZ980v6YUyk_uaY3PH6bmHwbmiY3PhDdsHwBnHnVrHnzridbuHckPjPbmHmvP1N_nHDQPWcvrjDkPj0vP1cdPjNzrak_FhQfuvIGU-qd0vRzgv-b5HThuA-107qWmgw-5HDzFhwG0LKxUAqWmykqniuWUA--UMwxIgP-0-qGujYhuyOYpgwOpyEqn10LP
HmkPjNkrjnkn1ThIgP-0h-b5HN3P1Dkn1EOPWT1rjN3Fh-VuybqFhR8IA-YXgwO0ANqnau-UMwGIZ-xmv7YuHYhuyOYpgwOgvQfmv7_5iubpgPkgLwGUyNqnHNdPHEzPH0OPHmLnBukULPGIA-fUWYdFhP_pyPopyEqmW76rHwbrjEVujb1mBYYnW66sHb1PjcVmWELujN1nH-Bm193FMK60h7V5HDkP10
lnaukUA7YuhqzUHYVniu_pgPYXhEqUA-1IadkmgF6UgI-pyICIadkmzukmyI-gLwO0ANqnikQnzuk0hqbIyPYpyEqnHTkFMP_ULwGujYQnjTknWN3FMwGujYvnjPWnHbvmzY3P1-6sHEvrjTVrywbuBYzmWNOuHw-nAmzPj9> 2019-04-16 22:43:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://bj.58.com/ershoufang/37756045083030x.shtml?adtype=3> (referer: https://bj.58.com/ershoufang/)
2019-04-16 22:43:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bj.58.com/ershoufang/37756045083030x.shtml?adtype=3>

当然,我提取到一份绝好的日志,看着这url如预期般地redirect真是舒服了:

fake_url -> jump_url -> real_url -> retry 1 times -> retry 2 times --- firewall但是并没有真的去,而是重新Request ---> real_url -> 200
2019-04-17 16:26:52 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://jing.58.com/adJump?adType=3&target=pZwY0ZnlszqBpB3draOWUvYfugF1pAqduh78uzt1P1m3rj0OP1TOrjmdng980v6YUyk_uadbujNYmyEOuBYknhuWsH
EYPWNVrjmLmiYdmHKbmWTkmyDQmhc_nHDQrjEkPj9vn1m1rjmYPW03Pak_FhQfuvIGU-qd0vRzgv-b5HThuA-107qWmgw-5HDzFhwG0LKxUAqWmykqniuWUA--UMwxIgP-0-qGujYhuyOYpgwOpyEqn10vrj9LrH0krH9vPHDhIgP-0h-b5HNLP1DLnHT3rj9Orj0YFh-VuybqFhR8IA-YXgwO0ANqnau-
UMwGIZ-xmv7YuHYhuyOYpgwOgvQfmv7_5iubpgPkgLwGUyNqnHNdPHE3rHN3PH0vnaukULPGIA-fUWYQnauWUA-Wpv-b5HbkPWn1mHm1sHIhnjmVPj-6raY3rHFBsHT3njubnAPhP1bvuBukmgF6UHYQnj0LrWTh0AQ6IAuf0hYqsHDhUA-1IZGb5yQG0LEV0A7zmydLuy-MpZEV0Anh0A7MuRqYXgK-5H
D_nHnh0ZFfuZRWIA-b5HDknau1UAqYpyEqnHTknjcdrauYpyEqmHnQnHFhnjNVnjuWuiYYmHFBsH--rHcVuyu-rAm3rjmzPWTL> from <GET https://short.58.com/zd_p/90633a63-7f06-49a8-892b-0806d0cf796f/> 2019-04-17 16:27:04 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://bj.58.com/ershoufang/37688797098651x.shtml?adtype=3> from <GET https://jing.58.com/adJump?adType=3&target=pZwY0ZnlszqBpB3draO
WUvYfugF1pAqduh78uzt1P1m3rj0OP1TOrjmdng980v6YUyk_uadbujNYmyEOuBYknhuWsHEYPWNVrjmLmiYdmHKbmWTkmyDQmhc_nHDQrjEkPj9vn1m1rjmYPW03Pak_FhQfuvIGU-qd0vRzgv-b5HThuA-107qWmgw-5HDzFhwG0LKxUAqWmykqniuWUA--UMwxIgP-0-qGujYhuyOYpgwOpyEqn10vr
j9LrH0krH9vPHDhIgP-0h-b5HNLP1DLnHT3rj9Orj0YFh-VuybqFhR8IA-YXgwO0ANqnau-UMwGIZ-xmv7YuHYhuyOYpgwOgvQfmv7_5iubpgPkgLwGUyNqnHNdPHE3rHN3PH0vnaukULPGIA-fUWYQnauWUA-Wpv-b5HbkPWn1mHm1sHIhnjmVPj-6raY3rHFBsHT3njubnAPhP1bvuBukmgF6UHYQnj0
LrWTh0AQ6IAuf0hYqsHDhUA-1IZGb5yQG0LEV0A7zmydLuy-MpZEV0Anh0A7MuRqYXgK-5HD_nHnh0ZFfuZRWIA-b5HDknau1UAqYpyEqnHTknjcdrauYpyEqmHnQnHFhnjNVnjuWuiYYmHFBsH--rHcVuyu-rAm3rjmzPWTL> 2019-04-17 16:27:26 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://bj.58.com/ershoufang/37688797098651x.shtml?adtype=3> (failed 1 times): TCP connection timed out: 10060: 由于连接方在一段时间后没有正确答复或
连接的主机没有反应,连接尝试失败。 2019-04-17 16:28:00 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://bj.58.com/ershoufang/37688797098651x.shtml?adtype=3> (failed 2 times): TCP connection timed out: 10060: 由于连接方在一段时间后没有正确答复或
连接的主机没有反应,连接尝试失败。 redirect_url:
https://callback.58.com/firewall/verifycode?serialId=70e3ea25cb505bc3d0746bb61d508d53_6da701bcb6ca44fd92bbe820a73dca84&code=22&sign=cc2a1d287fa102f0f21d33d91b3c51ea&namespace=ershoufangphp&url=https%3A
%2F%2Fbj.58.com%2Fershoufang%2F37688797098651x.shtml%3Fadtype%3D3
response_url:
https://bj.58.com/ershoufang/37688797098651x.shtml?adtype=3 2019-04-17 16:28:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://bj.58.com/ershoufang/37688797098651x.shtml?adtype=3> (referer: https://bj.58.com/ershoufang/)
2019-04-17 16:28:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bj.58.com/ershoufang/37688797098651x.shtml?adtype=3>
这个爬取路径可以说走过了全部4种情形,而最终还是顺利爬取到数据,应该比较有代表性了

其中一份日志结果:

{'downloader/exception_count': 136,
'downloader/exception_type_count/scrapy.core.downloader.handlers.http11.TunnelError': 16,
'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 11,
'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 76,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 30,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 3,
'downloader/request_bytes': 384128,
'downloader/request_count': 750,
'downloader/request_method_count/GET': 750,
'downloader/response_bytes': 2385832,
'downloader/response_count': 614,
'downloader/response_status_count/200': 123,
'downloader/response_status_count/302': 490,
'downloader/response_status_count/504': 1,
'dupefilter/filtered': 122,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 4, 17, 10, 52, 26, 392186),
'item_scraped_count': 122,
'log_count/DEBUG': 500,
'log_count/INFO': 27,
'log_count/WARNING': 2,
'request_depth_max': 1,
'response_received_count': 123,
'retry/count': 137,
'retry/reason_count/504 Gateway Time-out': 1,
'retry/reason_count/scrapy.core.downloader.handlers.http11.TunnelError': 16,
'retry/reason_count/twisted.internet.error.ConnectionRefusedError': 11,
'retry/reason_count/twisted.internet.error.TCPTimedOutError': 76,
'retry/reason_count/twisted.internet.error.TimeoutError': 30,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 3,
'scheduler/dequeued': 750,
'scheduler/dequeued/memory': 750,
'scheduler/enqueued': 750,
'scheduler/enqueued/memory': 750,
'start_time': datetime.datetime(2019, 4, 17, 10, 33, 6, 936247)}
2019-04-17 18:52:26 [scrapy.core.engine] INFO: Spider closed (finished)

可以看到还有许多地方需要改进,后续会分享我的优化思路O(∩_∩)O



scrapy爬取58同城二手房问题与对策的更多相关文章

  1. 利用python爬取58同城简历数据

    利用python爬取58同城简历数据 利用python爬取58同城简历数据 最近接到一个工作,需要获取58同城上面的简历信息(http://gz.58.com/qzyewu/).最开始想到是用pyth ...

  2. 爬虫--scrapy+redis分布式爬取58同城北京全站租房数据

    作业需求: 1.基于Spider或者CrawlSpider进行租房信息的爬取 2.本机搭建分布式环境对租房信息进行爬取 3.搭建多台机器的分布式环境,多台机器同时进行租房数据爬取 建议:用Pychar ...

  3. 养只爬虫当宠物(Node.js爬虫爬取58同城租房信息)

    先上一个源代码吧. https://github.com/answershuto/Rental 欢迎指导交流. 效果图 搭建Node.js环境及启动服务 安装node以及npm,用express模块启 ...

  4. python3爬虫-爬取58同城上所有城市的租房信息

    from fake_useragent import UserAgent from lxml import etree import requests, os import time, re, dat ...

  5. 用Python写爬虫爬取58同城二手交易数据

    爬了14W数据,存入Mongodb,用Charts库展示统计结果,这里展示一个示意 模块1 获取分类url列表 from bs4 import BeautifulSoup import request ...

  6. scrapy爬取全部知乎用户信息

    # -*- coding: utf-8 -*- # scrapy爬取全部知乎用户信息 # 1:是否遵守robbots_txt协议改为False # 2: 加入爬取所需的headers: user-ag ...

  7. Scrapy爬取Ajax(异步加载)网页实例——简书付费连载

    这两天学习了Scrapy爬虫框架的基本使用,练习的例子爬取的都是传统的直接加载完网页的内容,就想试试爬取用Ajax技术加载的网页. 这里以简书里的优选连载网页为例分享一下我的爬取过程. 网址为: ht ...

  8. Python——Scrapy爬取链家网站所有房源信息

    用scrapy爬取链家全国以上房源分类的信息: 路径: items.py # -*- coding: utf-8 -*- # Define here the models for your scrap ...

  9. Scrapy爬取美女图片 (原创)

    有半个月没有更新了,最近确实有点忙.先是华为的比赛,接着实验室又有项目,然后又学习了一些新的知识,所以没有更新文章.为了表达我的歉意,我给大家来一波福利... 今天咱们说的是爬虫框架.之前我使用pyt ...

随机推荐

  1. python之路——23

    复习 1.类定义 函数--方法--动态属性 必须传self 变量--类属性--静态属性 __init__方法--初始化方法2.实例化 1.使用:对象 = 类() 2.实例和对象没有区别 3.对象调用方 ...

  2. 各平台操作系统查询主机WWPN

    查询主机WWPN 目录 3.4.3.8.2.3 查询主机WWPN 3.4.3.8.2.3.1 查看主机HBA相应端口的WWPN(Windows) 3.4.3.8.2.3.2 查看主机HBA相应端口的W ...

  3. 贝叶斯、朴素贝叶斯及调用spark官网 mllib NavieBayes示例

    贝叶斯法则   机器学习的任务:在给定训练数据A时,确定假设空间B中的最佳假设.   最佳假设:一种方法是把它定义为在给定数据A以及B中不同假设的先验概率的有关知识下的最可能假设   贝叶斯理论提供了 ...

  4. jq中的$操作符与其他js框架冲突

    解决办法: jq中存在方法:noConflict() 可返回对 jQuery 的引用. 使用示例: var jq = $.noConflict(); jq(document).ready(functi ...

  5. 【HDFS API编程】图解客户端写文件到HDFS的流程

  6. makefile中打印变量名字,方便调试

    $(warning $(DVD_SERVICE)) // DVD_SerVICE是Makefile中的变量 $(warning   ST40_IMPORTS is $(ST40_IMPORTS)) 变 ...

  7. Ping++支付

    第一次接触支付啊,有点小激动,所以写下这篇随笔以防以后忘记. ping++的文档还有服务都是挺好的,当你注册之后,就会给你发邮件.截图如下: 是不是感觉服务很不错. 接下来直入正题. 首先,我们需要加 ...

  8. Linux_Ubuntu_C++编程_如何完成一个C++编写,调试,运行。

    倘若没装那个软件,系统会提示,根据提示装软件.

  9. AE10.0及AE10.0以上的版本调用ESRI.ArcGIS.esriSystem出现的问题

    如果本地安装的是AE10.0以上,那么添加ESRI.ArcGIS.esriSystem引用时,会出现esriLicenseProductCode并不包含esriLicenseProductCodeAr ...

  10. Nginx与ftp服务器

    使用Nginx搭建ftp服务器