2.请求库之requests

requests模块阅读目录:

介绍
基于GET请求
基于POST请求
响应Response
高级用法

一.介绍

#介绍：使用requests可以模拟浏览器的请求，比起之前用到的urllib，requests模块的api更加便捷（本质就是封装了urllib3）

#注意：requests库发送请求将网页内容下载下来以后，并不会执行js代码，这需要我们自己分析目标站点然后发起新的request请求

#安装：pip3 install requests

#各种请求方式：常用的就是requests.get()和requests.post()

>>> import requests

>>> r = requests.get('https://api.github.com/events')

>>> r = requests.post('http://httpbin.org/post', data = {'key':'value'})

>>> r = requests.put('http://httpbin.org/put', data = {'key':'value'})

>>> r = requests.delete('http://httpbin.org/delete')

>>> r = requests.head('http://httpbin.org/get')

>>> r = requests.options('http://httpbin.org/get')

#建议在正式学习requests前，先熟悉下HTTP协议

http://www.cnblogs.com/linhaifeng/p/6266327.html

二.基于GET请求

1.基本请求

import requests

response=requests.get('http://dig.chouti.com/')

print(response.text)

2.带参数的GET请求->>>params

import requests

response=requests.get('https://www.baidu.com/s?wd=python&pn=0',

                      headers={

                        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',

                      })

print(response.text)

#如果查询关键词是中文或者有其他特殊符号，则不得不进行url编码

from urllib.parse import urlencode

wd='瞎驴'

encode_res=urlencode({'wd':wd},encoding='utf-8')

keyword=encode_res.split('=')[1]

print(keyword)

# 然后拼接成url

url='https://www.baidu.com/s?wd=%s&pn=0' %keyword

response=requests.get(url,

                      headers={

                        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',

                      })

res1=response.text

自己拼接GET参数

 #上述操作可以用requests模块的一个params参数搞定，本质还是调用urlencode

 from urllib.parse import urlencode

 wd='瞎驴老师'

 pn=0

 response=requests.get('https://www.baidu.com/s',

                       params={

                           'wd':wd,

                           'pn':pn

                       },

                       headers={

                         'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',

                       })

 res2=response.text

 #验证结果，打开a.html与b.html页面内容一样

 with open('a.html','w',encoding='utf-8') as f:

     f.write(res1)

 with open('b.html', 'w', encoding='utf-8') as f:

     f.write(res2)

params参数的使用

3.带参数的GET请求->>>headers

 #通常我们在发送请求时都需要带上请求头，请求头是将自身伪装成浏览器的关键，常见的有用的请求头如下

 Host

 Referer #大型网站通常都会根据该参数判断请求的来源

 User-Agent #客户端

 Cookie #Cookie信息虽然包含在请求头里，但requests模块有单独的参数来处理他，headers={}内就不要放它了，requests放在cookies里面

 #添加headers(浏览器会识别请求头,不加可能会被拒绝访问,比如访问https://www.zhihu.com/explore)

 import requests

 response=requests.get('https://www.zhihu.com/explore')

 response.status_code #

 #自己定制headers

 headers={

     'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36',

 }

 respone=requests.get('https://www.zhihu.com/explore',

                      headers=headers)

 print(respone.status_code) #

请求头示例

4.带参数的GET请求->>>cookies

 #登录github，然后从浏览器中获取cookies，以后就可以直接拿着cookie登录了，无需输入用户名密码

 #用户名:egonlin 邮箱378533872@qq.com 密码lhf@123

 import requests

 Cookies={   'user_session':'wGMHFJKgDcmRIVvcA14_Wrt_3xaUyJNsBnPbYzEL6L0bHcfc',

 }

 response=requests.get('https://github.com/settings/emails',

              cookies=Cookies) #github对请求头没有什么限制，我们无需定制user-agent，对于其他网站可能还需要定制

 print('378533872@qq.com' in response.text) #True

携带已包含用户信息的cookie访问需要登录的页面

三.基于POST请求

1.介绍

 #GET请求

 HTTP默认的请求方法就是GET

      * 没有请求体

      * 数据必须在1K之内！

      * GET请求数据会暴露在浏览器的地址栏中

 GET请求常用的操作：

        1. 在浏览器的地址栏中直接给出URL，那么就一定是GET请求

        2. 点击页面上的超链接也一定是GET请求

        3. 提交表单时，表单默认使用GET请求，但可以设置为POST

 #POST请求

 (1). 数据不会出现在地址栏中

 (2). 数据的大小没有上限

 (3). 有请求体

 (4). 请求体中如果存在中文，会使用URL编码！

 #！！！requests.post()用法与requests.get()完全一致，特殊的是

2.发送post请求，模拟浏览器登录行为

#对于登录来说，应该输错用户名或密码然后分析抓包流程，用脑子想一想，输对了浏览器就跳转了，还分析个毛线，累死你也找不到包！！！

 '''

 一 目标站点分析

     浏览器输入https://github.com/login

     然后输入错误的账号密码，抓包

     发现登录行为是post提交到：https://github.com/session

     而且请求头包含cookie

     而且请求体包含：

         commit:Sign in

         utf8:✓

         authenticity_token:lbI8IJCwGslZS8qJPnof5e7ZkCoSoMn6jmDTsL1r/m06NLyIbw7vCrpwrFAPzHMep3Tmf/TSJVoXWrvDZaVwxQ==

         login:egonlin

         password:123

 二 流程分析

     先GET：https://github.com/login拿到初始cookie与authenticity_token

     返回POST：https://github.com/session， 带上初始cookie，带上请求体（authenticity_token，用户名，密码等）

     最后拿到登录cookie

     ps：如果密码时密文形式，则可以先输错账号，输对密码，然后到浏览器中拿到加密后的密码，github的密码是明文

 '''

 import requests

 import re

 #第一次请求

 r1=requests.get('https://github.com/login')

 r1_cookie=r1.cookies.get_dict() #拿到初始cookie(未被授权)

 authenticity_token=re.findall(r'name="authenticity_token".*?value="(.*?)"',r1.text)[0] #从页面中拿到CSRF TOKEN

 #第二次请求：带着初始cookie和TOKEN发送POST请求给登录页面，带上账号密码

 data={

     'commit':'Sign in',

     'utf8':'✓',

     'authenticity_token':authenticity_token,

     'login':'317828332@qq.com',

     'password':'alex3714'

 }

 r2=requests.post('https://github.com/session',

              data=data,

              cookies=r1_cookie

              )

 login_cookie=r2.cookies.get_dict()

 #第三次请求：以后的登录，拿着login_cookie就可以,比如访问一些个人配置

 r3=requests.get('https://github.com/settings/emails',

                 cookies=login_cookie)

 print('317828332@qq.com' in r3.text) #True

自动登录github(自己处理cookie信息)

 import requests

 import re

 session=requests.session()

 #第一次请求

 r1=session.get('https://github.com/login')

 authenticity_token=re.findall(r'name="authenticity_token".*?value="(.*?)"',r1.text，re.S)[0] #从页面中拿到CSRF TOKEN

 #第二次请求

 data={

     'commit':'Sign in',

     'utf8':'✓',

     'authenticity_token':authenticity_token,

     'login':'317828332@qq.com',

     'password':'alex3714'

 }

 r2=session.post('https://github.com/session',

              data=data,

              )

 #第三次请求

 r3=session.get('https://github.com/settings/emails')

 print('317828332@qq.com' in r3.text) #True

request.session()自动帮我们保存cookie信息

3.补充

 requests.post(url='xxxxxxxx',

               data={'xxx':'yyy'}) #没有指定请求头,#默认的请求头:application/x-www-form-urlencoed

 #如果我们自定义请求头是application/json,并且用data传值, 则服务端取不到值

 requests.post(url='',

               data={'':1,},

               headers={

                   'content-type':'application/json'

               })

 requests.post(url='',

               json={'':1,},

               ) #默认的请求头:application/json

请求头是json,用data传不过去值

 :param method: method for the new :class:`Request` object.

     :param url: URL for the new :class:`Request` object.

     :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.

     :param data: (optional) Dictionary or list of tuples ``[(key, value)]`` (will be form-encoded), bytes, or file-like object to send in the body of the :class:`Request`.

     :param json: (optional) json data to send in the body of the :class:`Request`.

     :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.

     :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.

     :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload.

         ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')``

         or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string

         defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers

         to add for the file.

     :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.

     :param timeout: (optional) How many seconds to wait for the server to send data

         before giving up, as a float, or a :ref:`(connect timeout, read

         timeout) <timeouts>` tuple.

     :type timeout: float or tuple

     :param allow_redirects: (optional) Boolean. Enable/disable GET/OPTIONS/POST/PUT/PATCH/DELETE/HEAD redirection. Defaults to ``True``.

     :type allow_redirects: bool

     :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.

     :param verify: (optional) Either a boolean, in which case it controls whether we verify

             the server's TLS certificate, or a string, in which case it must be a path

             to a CA bundle to use. Defaults to ``True``.

     :param stream: (optional) if ``False``, the response content will be immediately downloaded.

     :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.

根据部分源码分析都可以传什么

四.响应Response

1.response属性

import requests

respone=requests.get('http://www.jianshu.com')

# respone属性

print(respone.text)#获取到的文本内容

print(respone.content)#获取到的媒体内容

print(respone.status_code)#响应状态码,200不可信

print(respone.headers)#响应头

print(respone.cookies)#响应的cookie

print(respone.cookies.get_dict())#转换为dict

print(respone.cookies.items())

print(respone.url)

print(respone.history)

print(respone.encoding)

#关闭：response.close()

from contextlib import closing

with closing(requests.get('xxx',stream=True)) as response:

    for line in response.iter_content():

    pass

2.编码问题

#编码问题

import requests

response=requests.get('http://www.autohome.com/news')

# response.encoding='gbk' #汽车之家网站返回的页面内容为gb2312编码的，而requests的默认编码为ISO-8859-1，如果不设置成gbk则中文乱码

print(response.text)

3.获取二进制数据

import requests

response=requests.get('https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1509868306530&di=712e4ef3ab258b36e9f4b48e85a81c9d&imgtype=0&src=http%3A%2F%2Fc.hiphotos.baidu.com%2Fimage%2Fpic%2Fitem%2F11385343fbf2b211e1fb58a1c08065380dd78e0c.jpg')

with open('a.jpg','wb') as f:

    f.write(response.content)

 #stream参数:一点一点的取,比如下载视频时,如果视频100G,用response.content然后一下子写到文件中是不合理的

 import requests

 response=requests.get('https://gss3.baidu.com/6LZ0ej3k1Qd3ote6lo7D0j9wehsv/tieba-smallvideo-transcode/1767502_56ec685f9c7ec542eeaf6eac93a65dc7_6fe25cd1347c_3.mp4',

                       stream=True)

 with open('b.mp4','wb') as f:

     for line in response.iter_content():

         f.write(line)

获取二进制流

4.解析json

 #解析json

 import requests

 response=requests.get('http://httpbin.org/get')

 import json

 res1=json.loads(response.text) #太麻烦

 res2=response.json() #直接获取json数据

 print(res1 == res2) #True

response能直接解析json

5.Redirection and History

 By default Requests will perform location redirection for all verbs except HEAD.

 We can use the history property of the Response object to track redirection.

 The Response.history list contains the Response objects that were created in order to complete the request. The list is sorted from the oldest to the most recent response.

 For example, GitHub redirects all HTTP requests to HTTPS:

 >>> r = requests.get('http://github.com')

 >>> r.url

 'https://github.com/'

 >>> r.status_code

 >>> r.history

 [<Response [301]>]

 If you're using GET, OPTIONS, POST, PUT, PATCH or DELETE, you can disable redirection handling with the allow_redirects parameter:

 >>> r = requests.get('http://github.com', allow_redirects=False)

 >>> r.status_code

 >>> r.history

 []

 If you're using HEAD, you can enable redirection as well:

 >>> r = requests.head('http://github.com', allow_redirects=True)

 >>> r.url

 'https://github.com/'

 >>> r.history

 [<Response [301]>]

官网释义

 import requests

 import re

 #第一次请求

 r1=requests.get('https://github.com/login')

 r1_cookie=r1.cookies.get_dict() #拿到初始cookie(未被授权)

 authenticity_token=re.findall(r'name="authenticity_token".*?value="(.*?)"',r1.text)[0] #从页面中拿到CSRF TOKEN

 #第二次请求：带着初始cookie和TOKEN发送POST请求给登录页面，带上账号密码

 data={

     'commit':'Sign in',

     'utf8':'✓',

     'authenticity_token':authenticity_token,

     'login':'317828332@qq.com',

     'password':'alex3714'

 }

 #测试一：没有指定allow_redirects=False,则响应头中出现Location就跳转到新页面，r2代表新页面的response

 r2=requests.post('https://github.com/session',

              data=data,

              cookies=r1_cookie

              )

 print(r2.status_code) #

 print(r2.url) #看到的是跳转后的页面

 print(r2.history) #看到的是跳转前的response

 print(r2.history[0].text) #看到的是跳转前的response.text

 #测试二：指定allow_redirects=False,则响应头中即便出现Location也不会跳转到新页面，r2代表的仍然是老页面的response

 r2=requests.post('https://github.com/session',

              data=data,

              cookies=r1_cookie,

              allow_redirects=False

              )

 print(r2.status_code) #

 print(r2.url) #看到的是跳转前的页面https://github.com/session

 print(r2.history) #[]

通过github登陆后跳转到主页的来验证

五.高级用法

1.SSL Cert Verification

 #证书验证(大部分网站都是https)

 import requests

 respone=requests.get('https://www.12306.cn') #如果是ssl请求,首先检查证书是否合法,不合法则报错,程序中断

 #改进1:去掉报错,但是会报警告

 import requests

 respone=requests.get('https://www.12306.cn',verify=False) #不验证证书,报警告,返回200

 print(respone.status_code)

 #改进2:去掉报错,并且去掉警报信息

 import requests

 from requests.packages import urllib3

 urllib3.disable_warnings() #关闭警告

 respone=requests.get('https://www.12306.cn',verify=False)

 print(respone.status_code)

 #改进3:加上证书

 #很多网站都是https,但是不用证书也可以访问,大多数情况都是可以携带也可以不携带证书

 #知乎\百度等都是可带可不带

 #有硬性要求的,则必须带，比如对于定向的用户,拿到证书后才有权限访问某个特定网站

 import requests

 respone=requests.get('https://www.12306.cn',

                      cert=('/path/server.crt',

                            '/path/key'))

 print(respone.status_code)

 空

2.使用代理

 #官网链接: http://docs.python-requests.org/en/master/user/advanced/#proxies

 #代理设置:先发送请求给代理,然后由代理帮忙发送(封ip是常见的事情)

 import requests

 proxies={

     'http':'http://egon:123@localhost:9743',#带用户名密码的代理,@符号前是用户名与密码

     'http':'http://localhost:9743',

     'https':'https://localhost:9743',

 }

 respone=requests.get('https://www.12306.cn',

                      proxies=proxies)

 print(respone.status_code)

 #支持socks代理,安装:pip install requests[socks]

 import requests

 proxies = {

     'http': 'socks5://user:pass@host:port',

     'https': 'socks5://user:pass@host:port'

 }

 respone=requests.get('https://www.12306.cn',

                      proxies=proxies)

 print(respone.status_code)

3.超时设置

 #超时设置

 #两种超时:float or tuple

 #timeout=0.1 #代表接收数据的超时时间

 #timeout=(0.1,0.2)#0.1代表链接超时  0.2代表接收数据的超时时间

 import requests

 respone=requests.get('https://www.baidu.com',

                      timeout=0.0001)

4.认证设置

 #官网链接：http://docs.python-requests.org/en/master/user/authentication/

 #认证设置:登陆网站是,弹出一个框,要求你输入用户名密码（与alter很类似），此时是无法获取html的

 # 但本质原理是拼接成请求头发送

 #         r.headers['Authorization'] = _basic_auth_str(self.username, self.password)

 # 一般的网站都不用默认的加密方式，都是自己写,不是傻子,用基础的

 # 那么我们就需要按照网站的加密方式，自己写一个类似于_basic_auth_str的方法

 # 得到加密字符串后添加到请求头

 #         r.headers['Authorization'] =func('.....')

 #看一看默认的加密方式吧，通常网站都不会用默认的加密设置

 import requests

 from requests.auth import HTTPBasicAuth

 r=requests.get('url',auth=HTTPBasicAuth('user','password'))

 print(r.status_code)

 #HTTPBasicAuth可以简写为如下格式

 import requests

 r=requests.get('url',auth=('user','password'))

 print(r.status_code)

 # 一般用于内部网站

5.异常处理

 #异常处理

 import requests

 from requests.exceptions import * #可以查看requests.exceptions获取异常类型

 try:

     r=requests.get('http://www.baidu.com',timeout=0.00001)

 except ReadTimeout:

     print('===:')

 # except ConnectionError: #网络不通

 #     print('-----')

 # except Timeout:

 #     print('aaaaa')

 except RequestException:

     print('Error')

异常处理

6.上传文件

 import requests

 files={'file':open('a.jpg','rb')}

 respone=requests.post('http://httpbin.org/post',files=files)

 print(respone.status_code)

7.requests官网

中文文档

2.请求库之requests的更多相关文章

爬虫请求库之requests库
一.介绍介绍:使用requests可以模拟浏览器的请求,比之前的urllib库使用更加方便注意:requests库发送请求将网页内容下载下来之后,并不会执行js代码,这需要我们自己分析目标站点然后 ...
第二篇：请求库之requests,selenium
requests模块一.介绍 #介绍:使用requests可以模拟浏览器的请求,比起之前用到的urllib,requests模块的api更加便捷(本质就是封装了urllib3) #注意:reques ...
请求库之requests模块
本片导航: 介绍基于GET请求基于POST请求响应Response 高级用法一.介绍 #介绍:使用requests可以模拟浏览器的请求,比起之前用到的urllib,requests模块的a ...
02 请求库之 requests模块
requests模块一介绍 #介绍:使用requests可以模拟浏览器的请求,比起之前用到的urllib,requests模块的api更加便捷(本质就是封装了urllib3) #注意:requ ...
三、请求库之requests模块
一介绍 #介绍:使用requests可以模拟浏览器的请求,比起之前用到的urllib,requests模块的api更加便捷(本质就是封装了urllib3) #注意:requests库发送请求将网 ...
爬虫——请求库之requests
阅读目录一介绍二基于GET请求三基于POST请求四响应Response 五高级用法一介绍 #介绍:使用requests可以模拟浏览器的请求,比起之前用到的urllib,reque ...
请求库之requests
一介绍 #介绍:使用requests可以模拟浏览器的请求,比起之前用到的urllib,requests模块的api更加便捷(本质就是封装了urllib3) #注意:requests库发送请求将网页内 ...
请求库之requests，selenium
requests模块一.介绍 #介绍:使用requests可以模拟浏览器的请求,比起之前用到的urllib,requests模块的api更加便捷(本质就是封装了urllib3) #注意:reques ...
爬虫 - 请求库之requests
介绍使用requests可以模拟浏览器的请求,比起python内置的urllib模块,requests模块的api更加便捷(本质就是封装了urllib3) 注意:requests库发送请求将网页内容 ...

随机推荐

动态改变Android控件大小
Button button = (Button) findViewById(R.id.button2);button.setOnClickListener(myOnClickListener); // ...
给ul下的li加click时间
$('.province ul li').click(function() {//方法 });
Openshift 3.6 安装
因为有客户需求,所以必须尝试一下,可悲的是手里只有3.7的离线安装文档,加上之前3.11安装因为同事文档写得太好,基本没遇到什么坑,所以就没仔细研究就开始搞了. 结果果然是因为/etc/ansible ...
linux如何修改登录用户密码
root登录后,passwd root可以修改root帐号的密码其他具有sudo权限的用户登录后,sudo passwd root可以修改根帐号的密码 sudo passwd admin可以修改ad ...
Spring与Quartz的整合
Quartz Quartz是一个完全由Java编写的开源作业调度框架,为在Java应用程序中进行作业调度提供了简单却强大的机制.Quartz允许开发人员根据时间间隔来调度作业.它实现了作业和触发器的多 ...
XSS之浪潮已经来临
前些天和Roy厉在微博上聊到微信公众账号,我说我在辛苦运营“网站安全中心”这个账号呢,他说我这账号粉丝少是少了点,不过用户定位精确,我说我不希望精确,因为我在尽可能写科普,科普需要传播. Roy厉说过 ...
scrapy处理需要跟进的url
在做scrapy爬虫的时候经常会遇到需要跟进url的情况,网站a有许多url,但是我们需要跟进这些url,进一步获取这些url中的详细内容. 简单的说就是要先解析出所有需要的url,然后跟进这些url ...
Chrome/FireFox处理JSON的插件
Chrome/FireFox处理JSON的插件 JSON插件效果对比对于json的数据如果不编排一下格式查看起来很费劲,今天推荐一款chrome/Firfox下处理json的插件JSON-ha ...
JS排序：localeCompare() 方法实现中文排序、sort方法实现数字英文混合排序
定义:用本地特定的顺序来比较两个字符串. 语法:stringObject.localeCompare(target) 参数:target——要以本地特定的顺序与 stringObject 进行比较的字 ...
JDBC纯驱动方式连接MySQL
1 新建一个名为MysqlDemo的JavaProject 2 从http://dev.mysql.com/downloads/connector/j/中下载最新的驱动包. 这里有.tar.gz和.z ...

2.请求库之requests

requests模块阅读目录:

一.介绍

二.基于GET请求

1.基本请求

2.带参数的GET请求->>>params

3.带参数的GET请求->>>headers

4.带参数的GET请求->>>cookies

三.基于POST请求

1.介绍

2.发送post请求，模拟浏览器登录行为

3.补充

四.响应Response

1.response属性

2.编码问题

3.获取二进制数据

4.解析json

5.Redirection and History

五.高级用法

1.SSL Cert Verification

2.使用代理

3.超时设置

4.认证设置

5.异常处理

6.上传文件

7.requests官网

2.请求库之requests的更多相关文章

随机推荐

热门专题