爬虫基础库之requests
requests
Python标准库中提供了:urllib、urllib2、httplib等模块以供Http请求,但是,它的 API 太渣了。它是为另一个时代、另一个互联网所创建的。它需要巨量的工作,甚至包括各种方法覆盖,来完成最简单的任务。
Requests 是使用 Apache2 Licensed 许可证的 基于Python开发的HTTP 库,其在Python内置模块的基础上进行了高度的封装,从而使得Pythoner进行网络请求时,变得美好了许多,使用Requests可以轻而易举的完成浏览器可有的任何操作。
1、GET请求
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
# 1、无参数实例 import requests ret = requests.get( 'https://github.com/timeline.json' ) print ret.url print ret.text # 2、有参数实例 import requests payload = { 'key1' : 'value1' , 'key2' : 'value2' } ret = requests.get( "http://httpbin.org/get" , params = payload) print ret.url print ret.text |
2、POST请求
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
# 1、基本POST实例 import requests payload = { 'key1' : 'value1' , 'key2' : 'value2' } ret = requests.post( "http://httpbin.org/post" , data = payload) print ret.text # 2、发送请求头和数据实例 import requests import json url = 'https://api.github.com/some/endpoint' payload = { 'some' : 'data' } headers = { 'content-type' : 'application/json' } ret = requests.post(url, data = json.dumps(payload), headers = headers) print ret.text print ret.cookies |
3、其他请求
1
2
3
4
5
6
7
8
9
10
|
requests.get(url, params = None , * * kwargs) requests.post(url, data = None , json = None , * * kwargs) requests.put(url, data = None , * * kwargs) requests.head(url, * * kwargs) requests.delete(url, * * kwargs) requests.patch(url, data = None , * * kwargs) requests.options(url, * * kwargs) # 以上方法均是在此方法的基础上构建 requests.request(method, url, * * kwargs) |
4、更多参数
def request(method, url, **kwargs):
"""Constructs and sends a :class:`Request <Request>`. :param method: method for the new :class:`Request` object.
:param url: URL for the new :class:`Request` object.
:param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
:param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`.
:param json: (optional) json data to send in the body of the :class:`Request`.
:param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
:param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
:param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload.
``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')``
or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string
defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers
to add for the file.
:param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
:param timeout: (optional) How long to wait for the server to send data
before giving up, as a float, or a :ref:`(connect timeout, read
timeout) <timeouts>` tuple.
:type timeout: float or tuple
:param allow_redirects: (optional) Boolean. Set to True if POST/PUT/DELETE redirect following is allowed.
:type allow_redirects: bool
:param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
:param verify: (optional) whether the SSL cert will be verified. A CA_BUNDLE path can also be provided. Defaults to ``True``.
:param stream: (optional) if ``False``, the response content will be immediately downloaded.
:param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
:return: :class:`Response <Response>` object
:rtype: requests.Response Usage:: >>> import requests
>>> req = requests.request('GET', 'http://httpbin.org/get')
<Response [200]>
"""
def param_method_url():
# requests.request(method='get', url='http://127.0.0.1:8000/test/')
# requests.request(method='post', url='http://127.0.0.1:8000/test/')
pass def param_param():
# - 可以是字典
# - 可以是字符串
# - 可以是字节(ascii编码以内) # requests.request(method='get',
# url='http://127.0.0.1:8000/test/',
# params={'k1': 'v1', 'k2': '水电费'}) # requests.request(method='get',
# url='http://127.0.0.1:8000/test/',
# params="k1=v1&k2=水电费&k3=v3&k3=vv3") # requests.request(method='get',
# url='http://127.0.0.1:8000/test/',
# params=bytes("k1=v1&k2=k2&k3=v3&k3=vv3", encoding='utf8')) # 错误
# requests.request(method='get',
# url='http://127.0.0.1:8000/test/',
# params=bytes("k1=v1&k2=水电费&k3=v3&k3=vv3", encoding='utf8'))
pass def param_data():
# 可以是字典
# 可以是字符串
# 可以是字节
# 可以是文件对象 # requests.request(method='POST',
# url='http://127.0.0.1:8000/test/',
# data={'k1': 'v1', 'k2': '水电费'}) # requests.request(method='POST',
# url='http://127.0.0.1:8000/test/',
# data="k1=v1; k2=v2; k3=v3; k3=v4"
# ) # requests.request(method='POST',
# url='http://127.0.0.1:8000/test/',
# data="k1=v1;k2=v2;k3=v3;k3=v4",
# headers={'Content-Type': 'application/x-www-form-urlencoded'}
# ) # requests.request(method='POST',
# url='http://127.0.0.1:8000/test/',
# data=open('data_file.py', mode='r', encoding='utf-8'), # 文件内容是:k1=v1;k2=v2;k3=v3;k3=v4
# headers={'Content-Type': 'application/x-www-form-urlencoded'}
# )
pass def param_json():
# 将json中对应的数据进行序列化成一个字符串,json.dumps(...)
# 然后发送到服务器端的body中,并且Content-Type是 {'Content-Type': 'application/json'}
requests.request(method='POST',
url='http://127.0.0.1:8000/test/',
json={'k1': 'v1', 'k2': '水电费'}) def param_headers():
# 发送请求头到服务器端
requests.request(method='POST',
url='http://127.0.0.1:8000/test/',
json={'k1': 'v1', 'k2': '水电费'},
headers={'Content-Type': 'application/x-www-form-urlencoded'}
)
def param_cookies():
# 发送Cookie到服务器端
requests.request(method='POST',
url='http://127.0.0.1:8000/test/',
data={'k1': 'v1', 'k2': 'v2'},
cookies={'cook1': 'value1'},
)
# 也可以使用CookieJar(字典形式就是在此基础上封装)
from http.cookiejar import CookieJar
from http.cookiejar import Cookie obj = CookieJar()
obj.set_cookie(Cookie(version=0, name='c1', value='v1', port=None, domain='', path='/', secure=False, expires=None,
discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False,
port_specified=False, domain_specified=False, domain_initial_dot=False, path_specified=False)
)
requests.request(method='POST',
url='http://127.0.0.1:8000/test/',
data={'k1': 'v1', 'k2': 'v2'},
cookies=obj) def param_files():
# 发送文件
# file_dict = {
# 'f1': open('readme', 'rb')
# }
# requests.request(method='POST',
# url='http://127.0.0.1:8000/test/',
# files=file_dict) # 发送文件,定制文件名
# file_dict = {
# 'f1': ('test.txt', open('readme', 'rb'))
# }
# requests.request(method='POST',
# url='http://127.0.0.1:8000/test/',
# files=file_dict) # 发送文件,定制文件名
# file_dict = {
# 'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf")
# }
# requests.request(method='POST',
# url='http://127.0.0.1:8000/test/',
# files=file_dict) # 发送文件,定制文件名
# file_dict = {
# 'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf", 'application/text', {'k1': '0'})
# }
# requests.request(method='POST',
# url='http://127.0.0.1:8000/test/',
# files=file_dict) pass def param_auth():
from requests.auth import HTTPBasicAuth, HTTPDigestAuth ret = requests.get('https://api.github.com/user', auth=HTTPBasicAuth('wupeiqi', 'sdfasdfasdf'))
print(ret.text) # ret = requests.get('http://192.168.1.1',
# auth=HTTPBasicAuth('admin', 'admin'))
# ret.encoding = 'gbk'
# print(ret.text) # ret = requests.get('http://httpbin.org/digest-auth/auth/user/pass', auth=HTTPDigestAuth('user', 'pass'))
# print(ret)
# def param_timeout():
# ret = requests.get('http://google.com/', timeout=1)
# print(ret) # ret = requests.get('http://google.com/', timeout=(5, 1))
# print(ret)
pass def param_allow_redirects():
ret = requests.get('http://127.0.0.1:8000/test/', allow_redirects=False)
print(ret.text)
def param_proxies():
# proxies = {
# "http": "61.172.249.96:80",
# "https": "http://61.185.219.126:3128",
# } # proxies = {'http://10.20.1.128': 'http://10.10.1.10:5323'} # ret = requests.get("http://www.proxy360.cn/Proxy", proxies=proxies)
# print(ret.headers) # from requests.auth import HTTPProxyAuth
#
# proxyDict = {
# 'http': '77.75.105.165',
# 'https': '77.75.105.165'
# }
# auth = HTTPProxyAuth('username', 'mypassword')
#
# r = requests.get("http://www.google.com", proxies=proxyDict, auth=auth)
# print(r.text) pass def param_stream():
ret = requests.get('http://127.0.0.1:8000/test/', stream=True)
print(ret.content)
ret.close() # from contextlib import closing
# with closing(requests.get('http://httpbin.org/get', stream=True)) as r:
# # 在此处理响应。
# for i in r.iter_content():
# print(i) def requests_session():
import requests session = requests.Session() ### 1、首先登陆任何页面,获取cookie i1 = session.get(url="http://dig.chouti.com/help/service") ### 2、用户登陆,携带上一次的cookie,后台对cookie中的 gpsd 进行授权
i2 = session.post(
url="http://dig.chouti.com/login",
data={
'phone': "8615131255089",
'password': "xxxxxx",
'oneMonth': ""
}
) i3 = session.post(
url="http://dig.chouti.com/link/vote?linksId=8589623",
)
print(i3.text)
} # proxies = {'http://10.20.1.128': 'http://10.10.1.10:5323'} # ret = requests.get("http://www.proxy360.cn/Proxy", proxies=proxies) # print(ret.headers) # from requests.auth import HTTPProxyAuth # # proxyDict = { # 'http': '77.75.105.165', # 'https': '77.75.105.165' # } # auth = HTTPProxyAuth('username', 'mypassword') # # r = requests.get("http://www.google.com", proxies=proxyDict, auth=auth) # print(r.text) pass def param_stream(): ret = requests.get('http://127.0.0.1:8000/test/', stream=True) print(ret.content) ret.close() # from contextlib import closing # with closing(requests.get('http://httpbin.org/get', stream=True)) as r: # # 在此处理响应。 # for i in r.iter_content(): # print(i) def requests_session(): import requests session = requests.Session() ### 1、首先登陆任何页面,获取cookie i1 = session.get(url="http://dig.chouti.com/help/service") ### 2、用户登陆,携带上一次的cookie,后台对cookie中的 gpsd 进行授权 i2 = session.post( url="http://dig.chouti.com/login", data={ 'phone': "8615131255089", 'password': "xxxxxx", 'oneMonth': "" } ) i3 = session.post( url="http://dig.chouti.com/link/vote?linksId=8589623", ) print(i3.text)
', secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False, port_specified=False, domain_specified=False, domain_initial_dot=False, path_specified=False) ) requests.request(method='POST', url='http://127.0.0.1:8000/test/', data={'k1': 'v1', 'k2': 'v2'}, cookies=obj) def param_files(): # 发送文件 # file_dict = { # 'f1': open('readme', 'rb') # } # requests.request(method='POST', # url='http://127.0.0.1:8000/test/', # files=file_dict) # 发送文件,定制文件名 # file_dict = { # 'f1': ('test.txt', open('readme', 'rb')) # } # requests.request(method='POST', # url='http://127.0.0.1:8000/test/', # files=file_dict) # 发送文件,定制文件名 # file_dict = { # 'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf") # } # requests.request(method='POST', # url='http://127.0.0.1:8000/test/', # files=file_dict) # 发送文件,定制文件名 # file_dict = { # 'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf", 'application/text', {'k1': '0'}) # } # requests.request(method='POST', # url='http://127.0.0.1:8000/test/', # files=file_dict) pass def param_auth(): from requests.auth import HTTPBasicAuth, HTTPDigestAuth ret = requests.get('https://api.github.com/user', auth=HTTPBasicAuth('wupeiqi', 'sdfasdfasdf')) print(ret.text) # ret = requests.get('http://192.168.1.1', # auth=HTTPBasicAuth('admin', 'admin')) # ret.encoding = 'gbk' # print(ret.text) # ret = requests.get('http://httpbin.org/digest-auth/auth/user/pass', auth=HTTPDigestAuth('user', 'pass')) # print(ret) # def param_timeout(): # ret = requests.get('http://google.com/', timeout=1) # print(ret) # ret = requests.get('http://google.com/', timeout=(5, 1)) # print(ret) pass def param_allow_redirects(): ret = requests.get('http://127.0.0.1:8000/test/', allow_redirects=False) print(ret.text) def param_proxies(): # proxies = { # "http": "61.172.249.96:80", # "https": "http://61.185.219.126:3128", # } # proxies = {'http://10.20.1.128': 'http://10.10.1.10:5323'} # ret = requests.get("http://www.proxy360.cn/Proxy", proxies=proxies) # print(ret.headers) # from requests.auth import HTTPProxyAuth # # proxyDict = { # 'http': '77.75.105.165', # 'https': '77.75.105.165' # } # auth = HTTPProxyAuth('username', 'mypassword') # # r = requests.get("http://www.google.com", proxies=proxyDict, auth=auth) # print(r.text) pass def param_stream(): ret = requests.get('http://127.0.0.1:8000/test/', stream=True) print(ret.content) ret.close() # from contextlib import closing # with closing(requests.get('http://httpbin.org/get', stream=True)) as r: # # 在此处理响应。 # for i in r.iter_content(): # print(i) def requests_session(): import requests session = requests.Session() ### 1、首先登陆任何页面,获取cookie i1 = session.get(url="http://dig.chouti.com/help/service") ### 2、用户登陆,携带上一次的cookie,后台对cookie中的 gpsd 进行授权 i2 = session.post( url="http://dig.chouti.com/login", data={ 'phone': "8615131255089", 'password': "xxxxxx", 'oneMonth': "" } ) i3 = session.post( url="http://dig.chouti.com/link/vote?linksId=8589623", ) print(i3.text)
备注:requests请求参数中,files={‘f1’:open('xxx','rb')} ,可以上传文件 , auth参数用于基础认证
timeout参数用于设置超时时间(单位为秒) timeout=2,表示请求连接时,如果2秒没有连接上,放弃此次请求,timeout也可以设置两个参数 timeout=(3,2) 第一个时间表示请求连接的时间,第二个 参数表示响应的时间。
proxies参数用于设置代理ip。 proxies = { "http": "61.172.249.96:80", "https": "http://61.185.219.126:3128"} 表示如果请求是http格式的,就访问http对应的代理, 如果请求是https格式 的,就访问https对应的代理。
使用proxies的还有另一种方式 proxies = {'http://访问的网址': 'http://10.10.1.10:5323'} 表示访问某个网址,对应的使用哪个代理。
如果代理加密了可以使用这个方式:
from requests.auth import HTTPProxyAuth
proxyDict = { 'http': '77.75.105.165', 'https': '77.75.105.165' } # 代理ip
auth = HTTPProxyAuth('username', 'mypassword')
r = requests.get("http://www.google.com", proxies=proxyDict, auth=auth)
cert参数表示使用证书(.pem文件)加密,如果有两个值时,表示给证书也加密, 搭配verify参数(用于确认证书)使用
另外,凡是以https开头的网址都是需要证书的
我们每次请求都会携带上次的cookie或者上几次的cookie。requests模块帮我们封装了一个session的方法,帮我们管理cookie和headers,帮助我们每次请求时,都会携带上一次的cookie,同时,响应回来后,会将新的cookie更新进去。
用法:session = requests.Session() session.get(....)
官方文档:http://cn.python-requests.org/zh_CN/latest/user/quickstart.html#id4
爬虫基础库之requests的更多相关文章
- 爬虫基础库之requests模块
一.requests模块简介 使用requests可以模拟浏览器请求,比起之前用到的urllib,requests模块的api更加快捷,其实ruquests的本质就是封装urllib3这个模块. re ...
- [python爬虫]Requests-BeautifulSoup-Re库方案--Requests库介绍
[根据北京理工大学嵩天老师“Python网络爬虫与信息提取”慕课课程编写 文章中部分图片来自老师PPT 慕课链接:https://www.icourse163.org/learn/BIT-10018 ...
- F#之旅5 - 小实践之下载网页(爬虫基础库)
参考文章:https://swlaschin.gitbooks.io/fsharpforfunandprofit/content/posts/fvsc-download.html 参考的文章教了我们如 ...
- python爬虫之路——初识爬虫三大库,requests,lxml,beautiful.
三大库:requests,lxml,beautifulSoup. Request库作用:请求网站获取网页数据. get()的基本使用方法 #导入库 import requests #向网站发送请求,获 ...
- 爬虫——请求库之requests
阅读目录 一 介绍 二 基于GET请求 三 基于POST请求 四 响应Response 五 高级用法 一 介绍 #介绍:使用requests可以模拟浏览器的请求,比起之前用到的urllib,reque ...
- 爬虫基础库之Selenium
1.简介 selenium最初是一个自动化测试工具,而爬虫中使用它主要是为了解决requests无法直接执行JavaScript代码的问题 selenium本质是通过驱动浏览器,完全模拟浏览器的操作, ...
- 爬虫 - 请求库之requests
介绍 使用requests可以模拟浏览器的请求,比起python内置的urllib模块,requests模块的api更加便捷(本质就是封装了urllib3) 注意:requests库发送请求将网页内容 ...
- 爬虫请求库之requests库
一.介绍 介绍:使用requests可以模拟浏览器的请求,比之前的urllib库使用更加方便 注意:requests库发送请求将网页内容下载下来之后,并不会执行js代码,这需要我们自己分析目标站点然后 ...
- 爬虫基础库之beautifulsoup的简单使用
beautifulsoup的简单使用 简单来说,Beautiful Soup是python的一个库,最主要的功能是从网页抓取数据.官方解释如下: ''' Beautiful Soup提供一些简单的.p ...
随机推荐
- HDU 2136 素数打表+求质数因子
Largest prime factor Time Limit: 5000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Oth ...
- [mysql]数据库引擎查看
1.查看数据库引擎 全局下,show engines; 2.察看数据库引擎 show variables like '%engine%'; 或者show create table xxx\G 会显示默 ...
- ACM1598并查集方法
find the most comfortable road Problem Description XX星有许多城市,城市之间通过一种奇怪的高速公路SARS(Super Air Roam Struc ...
- HDU3336 KMP+DP
Count the string Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others) ...
- 问题03.如果有多个集合的迭代处理情况【使用MAP】
在SQL开发过程中,动态构建In集合条件查询是比较常见的用法,在Mybatis中提供了foreach功能,该功能比较强大,它允许你指定一个集合,声明集合项和索引变量,它们可以用在元素体内.它也允许你指 ...
- 网络编程:I/O模型
I/O模型 Unix下可用的5种I/O模型有: 阻塞式I/O 非阻塞式I/O I/O复用(select和poll,epoll) 信号驱动式I/O 异步I/O(POSIX的aio_系列函数) 一个输入操 ...
- [Luogu 2604] ZJOI2010 网络扩容
[Luogu 2604] ZJOI2010 网络扩容 第一问直接最大流. 第二问,添加一遍带费用的边,边权 INF,超级源点连源点一条容量为 \(k\) 的边来限流,跑费用流. 大约是第一次用 nam ...
- Maven -- 在进行war打包时用正式环境的配置覆盖开发环境的配置
我们的配置文件一般都放在 src/main/resource 目录下. 假定我们的正式环境配置放在 src/main/online-resource 目录下. 那么打成war包时,我们希望用onli ...
- HDU 1070 Milk (模拟)
题目链接 Problem Description Ignatius drinks milk everyday, now he is in the supermarket and he wants to ...
- 大聊Python-----网络编程
什么是Socket? socket本质上就是在2台网络互通的电脑之间,架设一个通道,两台电脑通过这个通道来实现数据的互相传递. 我们知道网络 通信 都 是基于 ip+port 方能定位到目标的具体机器 ...