python网络爬虫之requests库

Requests库是用Python编写的HTTP客户端。Requests库比urlopen更加方便。可以节约大量的中间处理过程，从而直接抓取网页数据。来看下具体的例子：

def request_function_try():
    headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0'}
    r=requests.get(url="http://www.baidu.com",headers=headers)
    print "status code:%s" % r.status_code
    print "headers:%s" % r.headers
    print "encoding:%s" % r.encoding
    print "cookies:%s" % r.cookies
    print "url:%s" % r.url
    print r.content.decode('utf-8').encode('mbcs')

直接用requests.get()方法进行http链接，其中输入参数url以及headers。返回值就是网页的response。从返回的response中可以得到状态吗，头信息。编码范式，cookie值，网页地址以及网页代码

E:\python2.7.11\python.exe E:/py_prj/test3.py

status code:200

headers:{'Content-Encoding': 'gzip', 'Transfer-Encoding': 'chunked', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Server': 'bfe/1.0.8.18', 'Last-Modified': 'Mon, 23 Jan 2017 13:28:24 GMT', 'Connection': 'Keep-Alive', 'Pragma': 'no-cache', 'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Date': 'Sun, 17 Sep 2017 02:53:11 GMT', 'Content-Type': 'text/html'}

encoding:ISO-8859-1

cookies:{'.baidu.com': {'/': {'BDORZ': Cookie(version=0, name='BDORZ', value='27315', port=None, port_specified=False, domain='.baidu.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=1505702637, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False)}}}

url:http://www.baidu.com/

注意在获取网页代码的时候，由于有中文，在python2中直接打印会有问题。因此需要先解码然后编码。在这里编码的方式为mbcs。具体的编码方式可以通过如下的方式获取到。

sys.setdefaultencoding('utf-8')
type = sys.getfilesystemencoding()

requests中也有一个内置的json解码器，可以帮助解析得到的json数据

r=requests.get('https://github.com/timeline.json')
print r.json()

E:\python2.7.11\python.exe E:/py_prj/test3.py

{u'documentation_url': u'https://developer.github.com/v3/activity/events/#list-public-events', u'message': u'Hello there, wayfaring stranger. If you\u2019re reading this then you probably didn\u2019t see our blog post a couple of years back announcing that this API would go away: http://git.io/17AROg Fear not, you should be able to get what you need from the shiny new Events API instead.'}

如果想要传递数据，如何处理呢。在这里我们以百度搜索为例。在输入框中输入python,然后得到返回的结果。

def request_function_try1():
    reload(sys)
    sys.setdefaultencoding('utf-8')
    type = sys.getfilesystemencoding()
    print type
    headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0'}
    payload={'wd':'python'}
    r=requests.get(url="http://www.baidu.com/s",params=payload,headers=headers)
    print r.status_code
    print r.content.decode('utf-8').encode(type)
    fp = open('search2.html', 'w')
    for line in r.content:
        fp.write(line)
    fp.close()

这里为什么网址要用到http://www.baidu.com/s呢。我们从网页上来看下。在输入框中输入了python之后，网页其实跳转到了https://www.baidu.com/s的界面。后面跟的wd=python等都是输入的数据

执行结果如下：

status code:200

headers:{'Strict-Transport-Security': 'max-age=172800', 'Bdqid': '0xeb453e0b0000947a', 'Content-Encoding': 'gzip', 'Transfer-Encoding': 'chunked', 'Set-Cookie': 'BDSVRTM=0; path=/, BD_HOME=0; path=/, H_PS_PSSID=1421_21078_17001_24394; path=/; domain=.baidu.com', 'Expires': 'Sun, 17 Sep 2017 02:56:13 GMT', 'Bduserid': '0', 'X-Powered-By': 'HPHP', 'Server': 'BWS/1.1', 'Connection': 'Keep-Alive', 'Cxy_all': 'baidu+2455763ad13223918d1e7f7431d4d18e', 'Cache-Control': 'private', 'Date': 'Sun, 17 Sep 2017 02:56:43 GMT', 'Vary': 'Accept-Encoding', 'Content-Type': 'text/html; charset=utf-8', 'Bdpagetype': '1', 'X-Ua-Compatible': 'IE=Edge,chrome=1'}

encoding:utf-8

cookies:<RequestsCookieJar[<Cookie H_PS_PSSID=1421_21078_17001_24394 for .baidu.com/>, <Cookie BDSVRTM=0 for www.baidu.com/>, <Cookie BD_HOME=0 for www.baidu.com/>]>

url:https://www.baidu.com/

如果我们访问的网站返回的状态码不是200.这个时候requests库也有异常处理的方式就是raise_for_status.当返回为非200响应的时候抛出异常

url='http://www.baidubaidu.com/'
try:
    r=requests.get(url)
    r.raise_for_status()
except requests.RequestException as e:
    print e

执行结果如下，在异常中会返回具体的错误码信息。

E:\python2.7.11\python.exe E:/py_prj/test3.py

409 Client Error: Conflict for url: http://www.baidubaidu.com/

我们再来看下如何模拟访问一个HTTPS网站。我们以CSDN网站为例。要想模拟登陆，首先要采集网页数据进行分析，这里用Fidder来采集。

(一)分析网页跳转，首先是登陆界面，网址是https://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn。 然后是自动跳转到my.csdn.net

(二)分析网页递交的数据。在右侧界面会出现网页实际递交的数据。上面的框是发送的头信息。下面是服务器返回数据的头信息。我们通过上面的数据来构造我们发送的头信息

(三)从上面第三步我们看到递交数据的方式是POST。那么我们需要看下POST的数据有哪些。点击webForms可以看到上传的数据，其中有username,password,lt,execution,_eventId等字段。我们将这些字段存取下来便于在代码中构造。

(四)最后一步就是查看跳转到mycsdn界面的数据，这一步是采用get的方法，只发送了头信息。因此只需要构造头信息就可以了。

数据流分析完了，下面就可以开始来构造代码了：

首先是构造头信息，最重要的是User-Agent，如果没有设置的话，会被网站给禁掉

headers={'host':'passport.csdn.net','User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36'}
headers1={'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36'}

然后就是构造头信息中的cookie值

cookie={'JSESSIONID':'5543aaaaaaaaaaaaaaaabbbbbB.tomcat2',
        'uuid_tt_dd':'-411111111111119_20170926','JSESSIONID':'2222222222222220265C40D8A33CB.tomcat2',
        'UN':'XXXXX','UE':'xxxxx@163.com','BT':'334343481','LSSC':'LSSC-145514-7aaaaaaaaaaazgGmhFvHfO9taaaaaaaR-passport.csdn.net',
        'Hm_lvt_6bcd52f51bbbbbb2bec4a3997715ac':'15044213,150656493,15064444445,1534488843','Hm_lpvt_6bcd52f51bbbbbbbe32bec4a3997715ac':'1506388843',
        'dc_tos':'oabckz','dc_session_id':'15063aaaa027_0.7098840409889817','__message_sys_msg_id':'0','__message_gu_msg_id':'0','__message_cnel_msg_id':'0','__message_district_code':'000000','__message_in_school':'0'}

然后设置url以及post的data
url='https://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn'
data={'username':'xxxx','password':'xxxxx','lt':'LT-1522220-BSnH9fN6ycbbbbbqgsSP2waaa1jvq','execution':'e4ab','_eventId':'submit'}

开始准备链接，这里用Session是为了保持后面的链接都是用的同一个回话，比如cookie值等

r=requests.Session()
r.post(url=url,headers=headers,cookies=cookie,data=data)

在这一步报错了，返回如下结果提示certificate verify failed

File "E:\python2.7.11\lib\site-packages\requests\adapters.py", line 506, in send

    raise SSLError(e, request=request)

requests.exceptions.SSLError: HTTPSConnectionPool(host='passport.csdn.net', port=443): Max retries exceeded with url: /account/login?from=http://my.csdn.net/my/mycsdn (Caused by SSLError(SSLError(1, u'[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)'),))

这个错误的原因在于Python 2.7.9 之后引入了一个新特性，当你urllib.urlopen一个 https 的时候会验证一次 SSL 证书 
当目标使用的是自签名的证书时就会爆出一个 urllib2.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:581)> 的错误消息

要解决这个问题PEP-0476文档这样说的：

For users who wish to opt out of certificate verification on a single connection, they can achieve this by providing the contextargument to urllib.urlopen

就是说你可以禁掉这个证书的要求，urllib来说有两种方式，一种是urllib.urlopen()有一个参数context,把他设成ssl._create_unverified_context

import ssl

context = ssl._create_unverified_context()  
urllib.urlopen("https://no-valid-cert", context=context)

但其实在requests中，有一个verify的参数，把它设成False就可以了

r.post(url=url,headers=headers,cookies=cookie,data=data,verify=False)

接下来访问mycsdn的地址。这样就成功的登录csdn网站了

s=r.get('http://my.csdn.net/my/mycsdn',headers=headers1)
print s.status_code
print s.content.decode('utf-8').encode(type)

python网络爬虫之requests库的更多相关文章

python网络爬虫之requests库二
前面一篇在介绍request登录CSDN网站的时候,是采用的固定cookie的方式,也就是先通过抓包的方式得到cookie值,然后将cookie值加在发送的数据包中发送到服务器进行认证. 就好比获取如 ...
04.Python网络爬虫之requests模块（1）
引入 Requests 唯一的一个非转基因的 Python HTTP 库,人类可以安全享用. 警告:非专业使用其他 HTTP 库会导致危险的副作用,包括:安全缺陷症.冗余代码症.重新发明轮子症.啃文档 ...
Python网络爬虫之requests模块（1）
引入 Requests 唯一的一个非转基因的 Python HTTP 库,人类可以安全享用. 警告:非专业使用其他 HTTP 库会导致危险的副作用,包括:安全缺陷症.冗余代码症.重新发明轮子症.啃文档 ...
Python 网络爬虫的常用库汇总
爬虫的编程语言有不少,但 Python 绝对是其中的主流之一.下面就为大家介绍下 Python 在编写网络爬虫常常用到的一些库. 请求库:实现 HTTP 请求操作 urllib:一系列用于操作URL的 ...
06.Python网络爬虫之requests模块（2）
今日内容 session处理cookie proxies参数设置请求代理ip 基于线程池的数据爬取知识点回顾 xpath的解析流程 bs4的解析流程常用xpath表达式常用bs4解析方法引入 ...
Python网络爬虫之requests模块（2）
session处理cookie proxies参数设置请求代理ip 基于线程池的数据爬取 xpath的解析流程 bs4的解析流程常用xpath表达式常用bs4解析方法引入有些时候,我们在使用爬 ...
Python网络爬虫之requests模块
今日内容 session处理cookie proxies参数设置请求代理ip 基于线程池的数据爬取知识点回顾 xpath的解析流程 bs4的解析流程常用xpath表达式常用bs4解析方法引入 ...
04，Python网络爬虫之requests模块（1）
引入 Requests 唯一的一个非转基因的 Python HTTP 库,人类可以安全享用. 警告:非专业使用其他 HTTP 库会导致危险的副作用,包括:安全缺陷症.冗余代码症.重新发明轮子症.啃文档 ...
【python网络爬虫】之requests相关模块
python网络爬虫的学习第一步 [python网络爬虫]之0 爬虫与反扒 [python网络爬虫]之一简单介绍 [python网络爬虫]之二 python uillib库 [python网络爬虫] ...

随机推荐

Codeforces Round #250 (Div. 2) A, B, C
A. The Child and Homework time limit per test 1 second memory limit per test 256 megabytes input sta ...
PE.微PE
1.老毛桃,大白菜(20180227) ZC:我记得以前 "老毛桃"."大白菜" 之前的版本,在安装好系统之后,是没有捆绑软件的,.现在,老毛桃安装完系统之后 ...
Agc_006 E Rotate 3x3
题目大意给定一个$3\times N$的方阵,每个位置的数恰好是每一个$[1,3\times N]$中的数. 初始时,每个位置$[x,y]$填的是$3(y-1)+x,(1\leq x\leq N,1 ...
DP小合集
1.Uva1625颜色的长度 dp[i][j]表示前一个串选到第i个后一个串选到第j个的最小价值记一下还有多少个没有结束即dp2 记一下每个数开始和结束的位置 #include<cstdi ...
使用Anthem.NET 1.5中的FileUpload控件实现Ajax方式的文件上传
Anthem.NET刚刚发布了其最新的1.5版本,其中很不错的一个新功能就是对文件上传功能的Ajax实现.本文将简要介绍一下该功能的使用方法. Anthem.NET的下载与安装 Anthem.NET可 ...
洛谷【P1104】生日（选择排序版）
题目传送门:https://www.luogu.org/problemnew/show/P1104 题目很简单,不过我是来讲选择排序的. 选择排序$(Selection sort)$是一种简单直观 ...
CF 1036B Diagonal Walking v.2——思路
题目:http://codeforces.com/contest/1036/problem/B 比赛时只能想出不合法的情况还有走到终点附近的方式. 设n<m,不合法就是m<k.走到终点方式 ...
cmdb1--介绍
背景:现在运维管理服务器多数使用Excel表来维护,而且是多人来维护,造成信息不统一,所以要将信息入库,并方便后续的批量操作 1.cmdb主要分3块: a.采集信息程序 b.API提供接口 c.后台管 ...
tomcat如何修改发布目录
tomcat免重启 tomcat访问的时候如何去掉项目名访问: 其中fts是您的项目名. 1.tomcat6.0:<Host></Host>间加了一句<Context p ...
bzoj4403
组合数学我好菜啊想到dp去了... 事实上对于固定长度的数列,我们只用考虑选了哪些数就行了,所以这个就是$C(n+m-1,m-1)$ 也就是$n$个数,划分成$m$段且允许空的方案数然后变成$\ ...

python网络爬虫之requests库

python网络爬虫之requests库的更多相关文章

随机推荐

热门专题