爬虫requests库的基本用法

需要注意的几个点：

1.后面的s是一个虚拟目录

2.url后面不用加问号，发起请求的时候会自动帮你加上问号

get_url = 'http://www.baidu.com/s'

url的特性：url必须是有ASCII编码的数据组成 ASCII表里有的数据

可以将请求携带的参数封装到一个字典中，当作参数传给post或get。

params中可以将携带的非ASCII数据转成ASCII，实际就是调用了urllib3

5.字典里的value必须是字符串形式，如果value是变量，而且变量传进来是其他类型，需要转成整型。

一、介绍

#介绍：使用requests可以模拟浏览器的请求，比起之前用到的urllib，requests模块的api更加便捷（本质就是封装了urllib3）

#注意：requests库发送请求将网页内容下载下来以后，并不会执行js代码，这需要我们自己分析目标站点然后发起新的request请求

#安装：pip3 install requests

#各种请求方式：常用的就是requests.get()和requests.post()

>>> import requests

>>> r = requests.get('https://api.github.com/events')

>>> r = requests.post('http://httpbin.org/post', data = {'key':'value'})

>>> r = requests.put('http://httpbin.org/put', data = {'key':'value'})

>>> r = requests.delete('http://httpbin.org/delete')

>>> r = requests.head('http://httpbin.org/get')

>>> r = requests.options('http://httpbin.org/get')

#建议在正式学习requests前，先熟悉下HTTP协议

http://www.cnblogs.com/linhaifeng/p/6266327.html

二、基于GET请求

1、基本请求

import requests

response=requests.get('http:/www.baidu.com/')

print(response.text)

2、带参数的GET请求->params

 #在请求头内将自己伪装成浏览器，否则百度不会正常返回页面内容

 import requests

 response=requests.get('https://www.baidu.com/s?wd=python&pn=1',

                       headers={

                         'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',

                       })

 print(response.text)

 #如果查询关键词是中文或者有其他特殊符号，则不得不进行url编码

 from urllib.parse import urlencode

 wb = "haiyan海燕"

 encode_res = urlencode({"k":wb},encoding="utf-8")

 print(encode_res)  #k=haiyan%E6%B5%B7%E7%87%95

 keywords = encode_res.split("=")[1]  #haiyan%E6%B5%B7%E7%87%95

 url = "https://www.baidu.com/s?wd=%s&pn=1"%(keywords)

 # url = "https://www.baidu.com/s?"+encode_res

 print(url)

 # 然后拼接成url

 response = requests.get(

     url,

     headers = {

         "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36",

     }

 )

 自己拼接GET参数

自己拼接参数

 #上述操作可以用requests模块的一个params参数搞定，本质还是调用urlencode

 from urllib.parse import urlencode

 wd='海燕nnn'

 pn=1

 response=requests.get('https://www.baidu.com/s',

                       params={

                           'wd':wd,

                           'pn':pn

                       },

                       headers={

                         'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',

                       })

 res2=response.text

 #验证结果，打开a.html与b.html页面内容一样

 with open('a.html','w',encoding='utf-8') as f:

     f.write(res1)

 with open('b.html', 'w', encoding='utf-8') as f:

     f.write(res2)

 params参数的使用

 params参数的使用

param参数

3、带参数的GET请求->headers

#通常我们在发送请求时都需要带上请求头，请求头是将自身伪装成浏览器的关键，常见的有用的请求头如下

Host

Referer #大型网站通常都会根据该参数判断请求的来源

User-Agent #客户端

Cookie #Cookie信息虽然包含在请求头里，但requests模块有单独的参数来处理他，headers={}内就不要放它了

#添加headers(浏览器会识别请求头,不加可能会被拒绝访问,比如访问https://www.zhihu.com/explore)

import requests

response=requests.get('https://www.zhihu.com/explore')

response.status_code #500

#自己定制headers

headers={

    'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36',

}

respone=requests.get('https://www.zhihu.com/explore',

                     headers=headers)

print(respone.status_code) #200

4、带参数的GET请求->cookies

#登录github，然后从浏览器中获取cookies，以后就可以直接拿着cookie登录了，无需输入用户名密码

#用户名:egonlin 邮箱378533872@qq.com 密码lhf@123

import requests

Cookies={   'user_session':'wGMHFJKgDcmRIVvcA14_Wrt_3xaUyJNsBnPbYzEL6L0bHcfc',

}

response=requests.get('https://github.com/settings/emails',

             cookies=Cookies) #github对请求头没有什么限制，我们无需定制user-agent，对于其他网站可能还需要定制

print('378533872@qq.com' in response.text) #True

三、基于POST请求

1、介绍

#GET请求

HTTP默认的请求方法就是GET

     * 没有请求体

     * 数据必须在1K之内！

     * GET请求数据会暴露在浏览器的地址栏中

GET请求常用的操作：

       1. 在浏览器的地址栏中直接给出URL，那么就一定是GET请求

       2. 点击页面上的超链接也一定是GET请求

       3. 提交表单时，表单默认使用GET请求，但可以设置为POST

#POST请求

(1). 数据不会出现在地址栏中

(2). 数据的大小没有上限

(3). 有请求体

(4). 请求体中如果存在中文，会使用URL编码！

#！！！requests.post()用法与requests.get()完全一致，特殊的是requests.post()有一个data参数，用来存放请求体数据

2、发送post请求，模拟浏览器的登录行为

注意：
　1、对于登录来说，应该输错用户名或密码然后分析抓包流程，用脑子想一想，输对了浏览器就跳转了，还分析个毛线，累死你也找不到包

　　2、要做登录的时候一定记得要把cookie先清除；
　　3、requests.session():中间的cookie都不用自己分析了，有用的没用的都给放进来了、
　　4、response.cookie.get_dict() #获取cookie

 '''

 一 目标站点分析

     浏览器输入https://github.com/login

     然后输入错误的账号密码，抓包

     发现登录行为是post提交到：https://github.com/session

     而且请求头包含cookie

     而且请求体包含：

         commit:Sign in

         utf8:✓

         authenticity_token:lbI8IJCwGslZS8qJPnof5e7ZkCoSoMn6jmDTsL1r/m06NLyIbw7vCrpwrFAPzHMep3Tmf/TSJVoXWrvDZaVwxQ==

         login:egonlin

         password:123

 二 流程分析

     先GET：https://github.com/login拿到初始cookie与authenticity_token

     返回POST：https://github.com/session， 带上初始cookie，带上请求体（authenticity_token，用户名，密码等）

     最后拿到登录cookie

     ps：如果密码时密文形式，则可以先输错账号，输对密码，然后到浏览器中拿到加密后的密码，github的密码是明文

 '''

 import requests

 import re

 #第一次请求

 r1=requests.get('https://github.com/login')

 r1_cookie=r1.cookies.get_dict() #拿到初始cookie(未被授权)

 authenticity_token=re.findall(r'name="authenticity_token".*?value="(.*?)"',r1.text)[0] #从页面中拿到CSRF TOKEN

 #第二次请求：带着初始cookie和TOKEN发送POST请求给登录页面，带上账号密码

 data={

     'commit':'Sign in',

     'utf8':'✓',

     'authenticity_token':authenticity_token,

     'login':'317828332@qq.com',

     'password':'alex3714'

 }

 r2=requests.post('https://github.com/session',

              data=data,

              cookies=r1_cookie

              )

 login_cookie=r2.cookies.get_dict()

 #第三次请求：以后的登录，拿着login_cookie就可以,比如访问一些个人配置

 r3=requests.get('https://github.com/settings/emails',

                 cookies=login_cookie)

 print('317828332@qq.com' in r3.text) #True

 自动登录github（自己处理cookie信息）

 自动登录github（自己处理cookie信息）

自动登陆github(自己出来icookie)

 import requests

 import re

 session=requests.session()

 #第一次请求

 r1=session.get('https://github.com/login')

 authenticity_token=re.findall(r'name="authenticity_token".*?value="(.*?)"',r1.text)[0] #从页面中拿到CSRF TOKEN

 #第二次请求

 data={

     'commit':'Sign in',

     'utf8':'✓',

     'authenticity_token':authenticity_token,

     'login':'317828332@qq.com',

     'password':'alex3714'

 }

 r2=session.post('https://github.com/session',

              data=data,

              )

 #第三次请求

 r3=session.get('https://github.com/settings/emails')

 print('317828332@qq.com' in r3.text) #True

  requests.session()自动帮我们保存cookie信息

requests.session自动帮我们保存cookie

登录github小应用

import requests

import re

#第一次请求

    # GET请求

    # 请求头

    #    - 获取token和

    #    - User-agent

    #    - cookie

# 第二次请求

    #POST请求

    #请求头

        # referer

        # User-agent

    #请求体

        #获取data

# 第三次请求，登录成功之后

    #- 请求之前自己先登录一下，看一下有没有referer

    #- 请求新的url，进行其他操作

    #- 查看用户名在不在里面

#第一次请求

response1 = requests.get(

    "https://github.com/login",

    headers = {

        "User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36",

    },

)

authenticity_token = re.findall('name="authenticity_token".*?value="(.*?)"',response1.text,re.S)

r1_cookies =  response1.cookies.get_dict()

# print(r1_cookies,"cookie")  #获取到的cookie

#第二次请求

response2 = requests.post(

    "https://github.com/session",

    headers = {

        "Referer": "https://github.com/",

        "User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36",

    },

    data={

            "commit":"Sign in",

            "utf8":"✓",

            "authenticity_token":authenticity_token,

            "login":"haiyanzzz",

            "password":"xxxx",

zhy..azjash1234

    },

    cookies = r1_cookies

)

print(response2.status_code)

print(response2.history)  #跳转的历史状态码

#第三次请求，登录成功之后，访问其他页面

r2_cookies = response2.cookies.get_dict()  #拿上cookie，知道是你登录了，就开始访问页面

response3 = requests.get(

    "https://github.com/settings/emails",

    headers = {

        "Referer": "https://github.com/",

        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36",

    },

    cookies = r2_cookies,

)

print(response3.text)

print("haiyanzzz" in response3.text)   #True返回True说明就成功了

3、补充

 requests.post(url='xxxxxxxx',

               data={'xxx':'yyy'}) #没有指定请求头,#默认的请求头:application/x-www-form-urlencoed

 #如果我们自定义请求头是application/json,并且用data传值, 则服务端取不到值

 requests.post(url='',

               data={'':1,},

               headers={

                   'content-type':'application/json'

               })

 requests.post(url='',

               json={'':1,},

               ) #默认的请求头:application/json

补充

四、响应Response

1、response属性

import requests

respone=requests.get('http://www.jianshu.com')

# respone属性

print(respone.text)

print(respone.content)

print(respone.status_code)

print(respone.headers)

print(respone.cookies)

print(respone.cookies.get_dict())

print(respone.cookies.items())

print(respone.url)

print(respone.history)

print(respone.encoding)

#关闭：response.close()

from contextlib import closing

with closing(requests.get('xxx',stream=True)) as response:

    for line in response.iter_content():

    pass

2、编码问题

#编码问题

import requests

response=requests.get('http://www.autohome.com/news')

# response.encoding='gbk' #汽车之家网站返回的页面内容为gb2312编码的，而requests的默认编码为ISO-8859-1，如果不设置成gbk则中文乱码

print(response.text)

3、获取二进制数据

import requests

response=requests.get('https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1509868306530&di=712e4ef3ab258b3
6e9f4b48e85a81c9d&imgtype=0&src=http%3A%2F%2Fc.hiphotos.baidu.com%2Fimage%2Fpic%2Fitem%2F11385343fbf2b211e1fb58a1c08065380dd78e0c.jpg')

with open('a.jpg','wb') as f:

    f.write(response.content)

stream参数:一点一点的取,比如下载视频时,如果视频100G,用response.content然后一下子写到文件中是不合理的

 requests.post(url='xxxxxxxx',

               data={'xxx':'yyy'}) #没有指定请求头,#默认的请求头:application/x-www-form-urlencoed

 #如果我们自定义请求头是application/json,并且用data传值, 则服务端取不到值

 requests.post(url='',

               data={'':1,},

               headers={

                   'content-type':'application/json'

               })

 requests.post(url='',

               json={'':1,},

               ) #默认的请求头:application/json

获取二进制流(iter_content)

4、解析json

#解析json

import requests

response=requests.get('http://httpbin.org/get')

import json

res1=json.loads(response.text) #太麻烦

res2=response.json() #直接获取json数据

print(res1 == res2) #True

5、Redirection and History

 By default Requests will perform location redirection for all verbs except HEAD.

 We can use the history property of the Response object to track redirection.

 The Response.history list contains the Response objects that were created in order to complete the request. The list is sorted from the oldest to the most recent response.

 For example, GitHub redirects all HTTP requests to HTTPS:

 >>> r = requests.get('http://github.com')

 >>> r.url

 'https://github.com/'

 >>> r.status_code

 >>> r.history

 [<Response [301]>]

 If you're using GET, OPTIONS, POST, PUT, PATCH or DELETE, you can disable redirection handling with the allow_redirects parameter:

 >>> r = requests.get('http://github.com', allow_redirects=False)

 >>> r.status_code

 >>> r.history

 []

 If you're using HEAD, you can enable redirection as well:

 >>> r = requests.head('http://github.com', allow_redirects=True)

 >>> r.url

 'https://github.com/'

 >>> r.history

 [<Response [301]>]

 先看官网的解释

import requests

import re

#第一次请求

r1=requests.get('https://github.com/login')

r1_cookie=r1.cookies.get_dict() #拿到初始cookie(未被授权)

authenticity_token=re.findall(r'name="authenticity_token".*?value="(.*?)"',r1.text)[0] #从页面中拿到CSRF TOKEN

#第二次请求：带着初始cookie和TOKEN发送POST请求给登录页面，带上账号密码

data={

    'commit':'Sign in',

    'utf8':'✓',

    'authenticity_token':authenticity_token,

    'login':'317828332@qq.com',

    'password':'alex3714'

}

#测试一：没有指定allow_redirects=False,则响应头中出现Location就跳转到新页面，r2代表新页面的response

r2=requests.post('https://github.com/session',

             data=data,

             cookies=r1_cookie

             )

print(r2.status_code) #200

print(r2.url) #看到的是跳转后的页面

print(r2.history) #看到的是跳转前的response

print(r2.history[0].text) #看到的是跳转前的response.text

#测试二：指定allow_redirects=False,则响应头中即便出现Location也不会跳转到新页面，r2代表的仍然是老页面的response

r2=requests.post('https://github.com/session',

             data=data,

             cookies=r1_cookie,

             allow_redirects=False

             )

print(r2.status_code) #302

print(r2.url) #看到的是跳转前的页面https://github.com/session

print(r2.history) #[]

利用github登录后跳转到主页面的例子来验证它

五、高级用法

1、SSL Cert Verification

 #证书验证(大部分网站都是https)

 import requests

 respone=requests.get('https://www.12306.cn') #如果是ssl请求,首先检查证书是否合法,不合法则报错,程序终端

 #改进1:去掉报错,但是会报警告

 import requests

 respone=requests.get('https://www.12306.cn',verify=False) #不验证证书,报警告,返回200

 print(respone.status_code)

 #改进2:去掉报错,并且去掉警报信息

 import requests

 from requests.packages import urllib3

 urllib3.disable_warnings() #关闭警告

 respone=requests.get('https://www.12306.cn',verify=False)

 print(respone.status_code)

 #改进3:加上证书

 #很多网站都是https,但是不用证书也可以访问,大多数情况都是可以携带也可以不携带证书

 #知乎\百度等都是可带可不带

 #有硬性要求的,则必须带，比如对于定向的用户,拿到证书后才有权限访问某个特定网站

 import requests

 respone=requests.get('https://www.12306.cn',

                      cert=('/path/server.crt',

                            '/path/key'))

 print(respone.status_code)

2、使用代理

 #官网链接: http://docs.python-requests.org/en/master/user/advanced/#proxies

 #代理设置:先发送请求给代理,然后由代理帮忙发送(封ip是常见的事情)

 import requests

 proxies={

     'http':'http://egon:123@localhost:9743',#带用户名密码的代理,@符号前是用户名与密码

     'http':'http://localhost:9743',

     'https':'https://localhost:9743',

 }

 respone=requests.get('https://www.12306.cn',

                      proxies=proxies)

 print(respone.status_code)

 #支持socks代理,安装:pip install requests[socks]

 import requests

 proxies = {

     'http': 'socks5://user:pass@host:port',

     'https': 'socks5://user:pass@host:port'

 }

 respone=requests.get('https://www.12306.cn',

                      proxies=proxies)

 print(respone.status_code)

3、超时设置

 #超时设置

 #两种超时:float or tuple

 #timeout=0.1 #代表接收数据的超时时间

 #timeout=(0.1,0.2)#0.1代表链接超时  0.2代表接收数据的超时时间

 import requests

 respone=requests.get('https://www.baidu.com',

                      timeout=0.0001)

4、认证设置

 #官网链接：http://docs.python-requests.org/en/master/user/authentication/

 #认证设置:登陆网站是,弹出一个框,要求你输入用户名密码（与alter很类似），此时是无法获取html的

 # 但本质原理是拼接成请求头发送

 #         r.headers['Authorization'] = _basic_auth_str(self.username, self.password)

 # 一般的网站都不用默认的加密方式，都是自己写

 # 那么我们就需要按照网站的加密方式，自己写一个类似于_basic_auth_str的方法

 # 得到加密字符串后添加到请求头

 #         r.headers['Authorization'] =func('.....')

 #看一看默认的加密方式吧，通常网站都不会用默认的加密设置

 import requests

 from requests.auth import HTTPBasicAuth

 r=requests.get('xxx',auth=HTTPBasicAuth('user','password'))

 print(r.status_code)

 #HTTPBasicAuth可以简写为如下格式

 import requests

 r=requests.get('xxx',auth=('user','password'))

 print(r.status_code)

5、异常处理

 #异常处理

 import requests

 from requests.exceptions import * #可以查看requests.exceptions获取异常类型

 try:

     r=requests.get('http://www.baidu.com',timeout=0.00001)

 except ReadTimeout:

     print('===:')

 # except ConnectionError: #网络不通

 #     print('-----')

 # except Timeout:

 #     print('aaaaa')

 except RequestException:

     print('Error')

6、上传文件

  import requests

  files={'file':open('a.jpg','rb')}

  respone=requests.post('http://httpbin.org/post',files=files)

  print(respone.status_code)

爬虫requests库的基本用法的更多相关文章

5.爬虫 requests库讲解高级用法
0.文件上传 import requests files = {'file': open('favicon.ico', 'rb')} response = requests.post("ht ...
Requests库详细的用法
介绍对了解一些爬虫的基本理念,掌握爬虫爬取的流程有所帮助.入门之后,我们就需要学习一些更加高级的内容和工具来方便我们的爬取.那么简单介绍一下 requests 库的基本用法安装利用 pip 安装 ...
Python爬虫Urllib库的高级用法
Python爬虫Urllib库的高级用法设置Headers 有些网站不会同意程序直接用上面的方式进行访问,如果识别有问题,那么站点根本不会响应,所以为了完全模拟浏览器的工作,我们需要设置一些Head ...
Python爬虫—requests库get和post方法使用
目录 Python爬虫-requests库get和post方法使用 1. 安装requests库 2.requests.get()方法使用 3.requests.post()方法使用-构造formda ...
Python中第三方库Requests库的高级用法详解
Python中第三方库Requests库的高级用法详解虽然Python的标准库中urllib2模块已经包含了平常我们使用的大多数功能,但是它的API使用起来让人实在感觉不好.它已经不适合现在的时代, ...
python爬虫---requests库的用法
requests是python实现的简单易用的HTTP库,使用起来比urllib简洁很多因为是第三方库,所以使用前需要cmd安装 pip install requests 安装完成后import一下 ...
爬虫 requests模块的其他用法抽屉网线程池回调爬取+保存实例,gihub登陆实例
requests模块的其他用法 #通常我们在发送请求时都需要带上请求头,请求头是将自身伪装成浏览器的关键,常见的有用的请求头如下 Host Referer #大型网站通常都会根据该参数判断请求的来源 ...
python爬虫---urllib库的基本用法
urllib是python自带的请求库,各种功能相比较之下也是比较完备的,urllib库包含了一下四个模块: urllib.request 请求模块 urllib.error 异常处理模块 u ...
Python爬虫--Requests库
Requests Requests是用python语言基于urllib编写的,采用的是Apache2 Licensed开源协议的HTTP库,requests是python实现的最简单易用的HTTP库, ...

随机推荐

codevs 1147 排座椅
传送门题目描述上课的时候总会有一些同学和前后左右的人交头接耳,这是令小学班主任十分头疼的一件事情.不过,班主任小雪发现了一些有趣的现象,当同学们的座次确定下来之后,只有有限的D对同学上课时会交头接 ...
android 收集的一些颜色值
<?xml version="1.0" encoding="utf-8"?> <resources> <color name=&q ...
bzoj3676
后缀自动机+manacher 听说本质不同的回文串只有O(n)个那么用manacher求出所有回文串,然后在sam上查找出现了几次就行了 sam的性质又忘了... manacher也忘了... #i ...
HDFS源码分析一-概述
HDFS 主要包含 NameNode, SecondaryNameNode, DataNode 以及 HDFS Client . 我们从以下这几部分讲: 1. HDFS概述 2. NameNode 实 ...
Tesseract的使用
参考:http://blog.csdn.net/qy20115549/article/details/78106569 下载tess4j的安装包. 首先,在该网站中下载tess4j的安装包. http ...
聊聊Java里常用的并发集合
前言在我们的程序开发过程中,如果涉及到多线程环境,那么对于集合框架的使用就必须更加谨慎了,因为大部分的集合类在不施加额外控制的情况下直接在并发环境中直接使用可能会出现数据不一致的问题,所以为了解决这 ...
小白使用Web Deploy在vs2015中发布到iis遇到的问题及操作流程
整体流程详细参照:http://www.cnblogs.com/potential/p/3751426.html 问题1.未能连接到远程计算机,请确保在远程计算机上安装了 Web Deploy 并启动 ...
HDU3555【数位DP】
入门...还在学习中,先贴一发大牛博客题目链接:http://acm.hdu.edu.cn/showproblem.php?pid=3555 题目大意: 给一个数字n,范围在1~2^63-1,求1~ ...
Codeforces712C【贪心】
看了这篇.. http://blog.csdn.net/queuelovestack/article/details/52503162 直接就是从小到大,那么每次按最大的递增顺序上去,就是了. 因为每 ...
[Xcode 实际操作]八、网络与多线程-(1)使用Reachability类库检测网络的连接状态
目录:[Swift]Xcode实际操作本文将演示如何使用Reachability网络状态检测库,检测设备的网络连接状态. 需要下载一个开源的类库:[ashleymills/Reachability. ...

爬虫requests库的基本用法

爬虫requests库的基本用法的更多相关文章

随机推荐

热门专题