爬虫基础库之requests
requests
Python标准库中提供了:urllib、urllib2、httplib等模块以供Http请求,但是,它的 API 太渣了。它是为另一个时代、另一个互联网所创建的。它需要巨量的工作,甚至包括各种方法覆盖,来完成最简单的任务。
Requests 是使用 Apache2 Licensed 许可证的 基于Python开发的HTTP 库,其在Python内置模块的基础上进行了高度的封装,从而使得Pythoner进行网络请求时,变得美好了许多,使用Requests可以轻而易举的完成浏览器可有的任何操作。
1、GET请求
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
# 1、无参数实例 import requests ret = requests.get( 'https://github.com/timeline.json' ) print ret.url print ret.text # 2、有参数实例 import requests payload = { 'key1' : 'value1' , 'key2' : 'value2' } ret = requests.get( "http://httpbin.org/get" , params = payload) print ret.url print ret.text |
2、POST请求
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
# 1、基本POST实例 import requests payload = { 'key1' : 'value1' , 'key2' : 'value2' } ret = requests.post( "http://httpbin.org/post" , data = payload) print ret.text # 2、发送请求头和数据实例 import requests import json url = 'https://api.github.com/some/endpoint' payload = { 'some' : 'data' } headers = { 'content-type' : 'application/json' } ret = requests.post(url, data = json.dumps(payload), headers = headers) print ret.text print ret.cookies |
3、其他请求
1
2
3
4
5
6
7
8
9
10
|
requests.get(url, params = None , * * kwargs) requests.post(url, data = None , json = None , * * kwargs) requests.put(url, data = None , * * kwargs) requests.head(url, * * kwargs) requests.delete(url, * * kwargs) requests.patch(url, data = None , * * kwargs) requests.options(url, * * kwargs) # 以上方法均是在此方法的基础上构建 requests.request(method, url, * * kwargs) |
4、更多参数

- def request(method, url, **kwargs):
- """Constructs and sends a :class:`Request <Request>`.
- :param method: method for the new :class:`Request` object.
- :param url: URL for the new :class:`Request` object.
- :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
- :param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`.
- :param json: (optional) json data to send in the body of the :class:`Request`.
- :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
- :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
- :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload.
- ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')``
- or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string
- defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers
- to add for the file.
- :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
- :param timeout: (optional) How long to wait for the server to send data
- before giving up, as a float, or a :ref:`(connect timeout, read
- timeout) <timeouts>` tuple.
- :type timeout: float or tuple
- :param allow_redirects: (optional) Boolean. Set to True if POST/PUT/DELETE redirect following is allowed.
- :type allow_redirects: bool
- :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
- :param verify: (optional) whether the SSL cert will be verified. A CA_BUNDLE path can also be provided. Defaults to ``True``.
- :param stream: (optional) if ``False``, the response content will be immediately downloaded.
- :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
- :return: :class:`Response <Response>` object
- :rtype: requests.Response
- Usage::
- >>> import requests
- >>> req = requests.request('GET', 'http://httpbin.org/get')
- <Response [200]>
- """


- def param_method_url():
- # requests.request(method='get', url='http://127.0.0.1:8000/test/')
- # requests.request(method='post', url='http://127.0.0.1:8000/test/')
- pass
- def param_param():
- # - 可以是字典
- # - 可以是字符串
- # - 可以是字节(ascii编码以内)
- # requests.request(method='get',
- # url='http://127.0.0.1:8000/test/',
- # params={'k1': 'v1', 'k2': '水电费'})
- # requests.request(method='get',
- # url='http://127.0.0.1:8000/test/',
- # params="k1=v1&k2=水电费&k3=v3&k3=vv3")
- # requests.request(method='get',
- # url='http://127.0.0.1:8000/test/',
- # params=bytes("k1=v1&k2=k2&k3=v3&k3=vv3", encoding='utf8'))
- # 错误
- # requests.request(method='get',
- # url='http://127.0.0.1:8000/test/',
- # params=bytes("k1=v1&k2=水电费&k3=v3&k3=vv3", encoding='utf8'))
- pass
- def param_data():
- # 可以是字典
- # 可以是字符串
- # 可以是字节
- # 可以是文件对象
- # requests.request(method='POST',
- # url='http://127.0.0.1:8000/test/',
- # data={'k1': 'v1', 'k2': '水电费'})
- # requests.request(method='POST',
- # url='http://127.0.0.1:8000/test/',
- # data="k1=v1; k2=v2; k3=v3; k3=v4"
- # )
- # requests.request(method='POST',
- # url='http://127.0.0.1:8000/test/',
- # data="k1=v1;k2=v2;k3=v3;k3=v4",
- # headers={'Content-Type': 'application/x-www-form-urlencoded'}
- # )
- # requests.request(method='POST',
- # url='http://127.0.0.1:8000/test/',
- # data=open('data_file.py', mode='r', encoding='utf-8'), # 文件内容是:k1=v1;k2=v2;k3=v3;k3=v4
- # headers={'Content-Type': 'application/x-www-form-urlencoded'}
- # )
- pass
- def param_json():
- # 将json中对应的数据进行序列化成一个字符串,json.dumps(...)
- # 然后发送到服务器端的body中,并且Content-Type是 {'Content-Type': 'application/json'}
- requests.request(method='POST',
- url='http://127.0.0.1:8000/test/',
- json={'k1': 'v1', 'k2': '水电费'})
- def param_headers():
- # 发送请求头到服务器端
- requests.request(method='POST',
- url='http://127.0.0.1:8000/test/',
- json={'k1': 'v1', 'k2': '水电费'},
- headers={'Content-Type': 'application/x-www-form-urlencoded'}
- )
- def param_cookies():
- # 发送Cookie到服务器端
- requests.request(method='POST',
- url='http://127.0.0.1:8000/test/',
- data={'k1': 'v1', 'k2': 'v2'},
- cookies={'cook1': 'value1'},
- )
- # 也可以使用CookieJar(字典形式就是在此基础上封装)
- from http.cookiejar import CookieJar
- from http.cookiejar import Cookie
- obj = CookieJar()
- obj.set_cookie(Cookie(version=0, name='c1', value='v1', port=None, domain='', path='/', secure=False, expires=None,
- discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False,
- port_specified=False, domain_specified=False, domain_initial_dot=False, path_specified=False)
- )
- requests.request(method='POST',
- url='http://127.0.0.1:8000/test/',
- data={'k1': 'v1', 'k2': 'v2'},
- cookies=obj)
- def param_files():
- # 发送文件
- # file_dict = {
- # 'f1': open('readme', 'rb')
- # }
- # requests.request(method='POST',
- # url='http://127.0.0.1:8000/test/',
- # files=file_dict)
- # 发送文件,定制文件名
- # file_dict = {
- # 'f1': ('test.txt', open('readme', 'rb'))
- # }
- # requests.request(method='POST',
- # url='http://127.0.0.1:8000/test/',
- # files=file_dict)
- # 发送文件,定制文件名
- # file_dict = {
- # 'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf")
- # }
- # requests.request(method='POST',
- # url='http://127.0.0.1:8000/test/',
- # files=file_dict)
- # 发送文件,定制文件名
- # file_dict = {
- # 'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf", 'application/text', {'k1': '0'})
- # }
- # requests.request(method='POST',
- # url='http://127.0.0.1:8000/test/',
- # files=file_dict)
- pass
- def param_auth():
- from requests.auth import HTTPBasicAuth, HTTPDigestAuth
- ret = requests.get('https://api.github.com/user', auth=HTTPBasicAuth('wupeiqi', 'sdfasdfasdf'))
- print(ret.text)
- # ret = requests.get('http://192.168.1.1',
- # auth=HTTPBasicAuth('admin', 'admin'))
- # ret.encoding = 'gbk'
- # print(ret.text)
- # ret = requests.get('http://httpbin.org/digest-auth/auth/user/pass', auth=HTTPDigestAuth('user', 'pass'))
- # print(ret)
- #
- def param_timeout():
- # ret = requests.get('http://google.com/', timeout=1)
- # print(ret)
- # ret = requests.get('http://google.com/', timeout=(5, 1))
- # print(ret)
- pass
- def param_allow_redirects():
- ret = requests.get('http://127.0.0.1:8000/test/', allow_redirects=False)
- print(ret.text)
- def param_proxies():
- # proxies = {
- # "http": "61.172.249.96:80",
- # "https": "http://61.185.219.126:3128",
- # }
- # proxies = {'http://10.20.1.128': 'http://10.10.1.10:5323'}
- # ret = requests.get("http://www.proxy360.cn/Proxy", proxies=proxies)
- # print(ret.headers)
- # from requests.auth import HTTPProxyAuth
- #
- # proxyDict = {
- # 'http': '77.75.105.165',
- # 'https': '77.75.105.165'
- # }
- # auth = HTTPProxyAuth('username', 'mypassword')
- #
- # r = requests.get("http://www.google.com", proxies=proxyDict, auth=auth)
- # print(r.text)
- pass
- def param_stream():
- ret = requests.get('http://127.0.0.1:8000/test/', stream=True)
- print(ret.content)
- ret.close()
- # from contextlib import closing
- # with closing(requests.get('http://httpbin.org/get', stream=True)) as r:
- # # 在此处理响应。
- # for i in r.iter_content():
- # print(i)
- def requests_session():
- import requests
- session = requests.Session()
- ### 1、首先登陆任何页面,获取cookie
- i1 = session.get(url="http://dig.chouti.com/help/service")
- ### 2、用户登陆,携带上一次的cookie,后台对cookie中的 gpsd 进行授权
- i2 = session.post(
- url="http://dig.chouti.com/login",
- data={
- 'phone': "8615131255089",
- 'password': "xxxxxx",
- 'oneMonth': ""
- }
- )
- i3 = session.post(
- url="http://dig.chouti.com/link/vote?linksId=8589623",
- )
- print(i3.text)
- } # proxies = {'http://10.20.1.128': 'http://10.10.1.10:5323'} # ret = requests.get("http://www.proxy360.cn/Proxy", proxies=proxies) # print(ret.headers) # from requests.auth import HTTPProxyAuth # # proxyDict = { # 'http': '77.75.105.165', # 'https': '77.75.105.165' # } # auth = HTTPProxyAuth('username', 'mypassword') # # r = requests.get("http://www.google.com", proxies=proxyDict, auth=auth) # print(r.text) pass def param_stream(): ret = requests.get('http://127.0.0.1:8000/test/', stream=True) print(ret.content) ret.close() # from contextlib import closing # with closing(requests.get('http://httpbin.org/get', stream=True)) as r: # # 在此处理响应。 # for i in r.iter_content(): # print(i) def requests_session(): import requests session = requests.Session() ### 1、首先登陆任何页面,获取cookie i1 = session.get(url="http://dig.chouti.com/help/service") ### 2、用户登陆,携带上一次的cookie,后台对cookie中的 gpsd 进行授权 i2 = session.post( url="http://dig.chouti.com/login", data={ 'phone': "8615131255089", 'password': "xxxxxx", 'oneMonth': "" } ) i3 = session.post( url="http://dig.chouti.com/link/vote?linksId=8589623", ) print(i3.text)
- ', secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False, port_specified=False, domain_specified=False, domain_initial_dot=False, path_specified=False) ) requests.request(method='POST', url='http://127.0.0.1:8000/test/', data={'k1': 'v1', 'k2': 'v2'}, cookies=obj) def param_files(): # 发送文件 # file_dict = { # 'f1': open('readme', 'rb') # } # requests.request(method='POST', # url='http://127.0.0.1:8000/test/', # files=file_dict) # 发送文件,定制文件名 # file_dict = { # 'f1': ('test.txt', open('readme', 'rb')) # } # requests.request(method='POST', # url='http://127.0.0.1:8000/test/', # files=file_dict) # 发送文件,定制文件名 # file_dict = { # 'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf") # } # requests.request(method='POST', # url='http://127.0.0.1:8000/test/', # files=file_dict) # 发送文件,定制文件名 # file_dict = { # 'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf", 'application/text', {'k1': '0'}) # } # requests.request(method='POST', # url='http://127.0.0.1:8000/test/', # files=file_dict) pass def param_auth(): from requests.auth import HTTPBasicAuth, HTTPDigestAuth ret = requests.get('https://api.github.com/user', auth=HTTPBasicAuth('wupeiqi', 'sdfasdfasdf')) print(ret.text) # ret = requests.get('http://192.168.1.1', # auth=HTTPBasicAuth('admin', 'admin')) # ret.encoding = 'gbk' # print(ret.text) # ret = requests.get('http://httpbin.org/digest-auth/auth/user/pass', auth=HTTPDigestAuth('user', 'pass')) # print(ret) # def param_timeout(): # ret = requests.get('http://google.com/', timeout=1) # print(ret) # ret = requests.get('http://google.com/', timeout=(5, 1)) # print(ret) pass def param_allow_redirects(): ret = requests.get('http://127.0.0.1:8000/test/', allow_redirects=False) print(ret.text) def param_proxies(): # proxies = { # "http": "61.172.249.96:80", # "https": "http://61.185.219.126:3128", # } # proxies = {'http://10.20.1.128': 'http://10.10.1.10:5323'} # ret = requests.get("http://www.proxy360.cn/Proxy", proxies=proxies) # print(ret.headers) # from requests.auth import HTTPProxyAuth # # proxyDict = { # 'http': '77.75.105.165', # 'https': '77.75.105.165' # } # auth = HTTPProxyAuth('username', 'mypassword') # # r = requests.get("http://www.google.com", proxies=proxyDict, auth=auth) # print(r.text) pass def param_stream(): ret = requests.get('http://127.0.0.1:8000/test/', stream=True) print(ret.content) ret.close() # from contextlib import closing # with closing(requests.get('http://httpbin.org/get', stream=True)) as r: # # 在此处理响应。 # for i in r.iter_content(): # print(i) def requests_session(): import requests session = requests.Session() ### 1、首先登陆任何页面,获取cookie i1 = session.get(url="http://dig.chouti.com/help/service") ### 2、用户登陆,携带上一次的cookie,后台对cookie中的 gpsd 进行授权 i2 = session.post( url="http://dig.chouti.com/login", data={ 'phone': "8615131255089", 'password': "xxxxxx", 'oneMonth': "" } ) i3 = session.post( url="http://dig.chouti.com/link/vote?linksId=8589623", ) print(i3.text)

备注:requests请求参数中,files={‘f1’:open('xxx','rb')} ,可以上传文件 , auth参数用于基础认证
timeout参数用于设置超时时间(单位为秒) timeout=2,表示请求连接时,如果2秒没有连接上,放弃此次请求,timeout也可以设置两个参数 timeout=(3,2) 第一个时间表示请求连接的时间,第二个 参数表示响应的时间。
proxies参数用于设置代理ip。 proxies = { "http": "61.172.249.96:80", "https": "http://61.185.219.126:3128"} 表示如果请求是http格式的,就访问http对应的代理, 如果请求是https格式 的,就访问https对应的代理。
使用proxies的还有另一种方式 proxies = {'http://访问的网址': 'http://10.10.1.10:5323'} 表示访问某个网址,对应的使用哪个代理。
如果代理加密了可以使用这个方式:
from requests.auth import HTTPProxyAuth
proxyDict = { 'http': '77.75.105.165', 'https': '77.75.105.165' } # 代理ip
auth = HTTPProxyAuth('username', 'mypassword')
r = requests.get("http://www.google.com", proxies=proxyDict, auth=auth)
cert参数表示使用证书(.pem文件)加密,如果有两个值时,表示给证书也加密, 搭配verify参数(用于确认证书)使用
另外,凡是以https开头的网址都是需要证书的
我们每次请求都会携带上次的cookie或者上几次的cookie。requests模块帮我们封装了一个session的方法,帮我们管理cookie和headers,帮助我们每次请求时,都会携带上一次的cookie,同时,响应回来后,会将新的cookie更新进去。
用法:session = requests.Session() session.get(....)
官方文档:http://cn.python-requests.org/zh_CN/latest/user/quickstart.html#id4
爬虫基础库之requests的更多相关文章
- 爬虫基础库之requests模块
一.requests模块简介 使用requests可以模拟浏览器请求,比起之前用到的urllib,requests模块的api更加快捷,其实ruquests的本质就是封装urllib3这个模块. re ...
- [python爬虫]Requests-BeautifulSoup-Re库方案--Requests库介绍
[根据北京理工大学嵩天老师“Python网络爬虫与信息提取”慕课课程编写 文章中部分图片来自老师PPT 慕课链接:https://www.icourse163.org/learn/BIT-10018 ...
- F#之旅5 - 小实践之下载网页(爬虫基础库)
参考文章:https://swlaschin.gitbooks.io/fsharpforfunandprofit/content/posts/fvsc-download.html 参考的文章教了我们如 ...
- python爬虫之路——初识爬虫三大库,requests,lxml,beautiful.
三大库:requests,lxml,beautifulSoup. Request库作用:请求网站获取网页数据. get()的基本使用方法 #导入库 import requests #向网站发送请求,获 ...
- 爬虫——请求库之requests
阅读目录 一 介绍 二 基于GET请求 三 基于POST请求 四 响应Response 五 高级用法 一 介绍 #介绍:使用requests可以模拟浏览器的请求,比起之前用到的urllib,reque ...
- 爬虫基础库之Selenium
1.简介 selenium最初是一个自动化测试工具,而爬虫中使用它主要是为了解决requests无法直接执行JavaScript代码的问题 selenium本质是通过驱动浏览器,完全模拟浏览器的操作, ...
- 爬虫 - 请求库之requests
介绍 使用requests可以模拟浏览器的请求,比起python内置的urllib模块,requests模块的api更加便捷(本质就是封装了urllib3) 注意:requests库发送请求将网页内容 ...
- 爬虫请求库之requests库
一.介绍 介绍:使用requests可以模拟浏览器的请求,比之前的urllib库使用更加方便 注意:requests库发送请求将网页内容下载下来之后,并不会执行js代码,这需要我们自己分析目标站点然后 ...
- 爬虫基础库之beautifulsoup的简单使用
beautifulsoup的简单使用 简单来说,Beautiful Soup是python的一个库,最主要的功能是从网页抓取数据.官方解释如下: ''' Beautiful Soup提供一些简单的.p ...
随机推荐
- 如何按需动态加载js文件
JavaScript无非就是script标签引入页面,但当项目越来越大的时候,单页面引入N个js显然不行,合并为单个文件减少了请求数,但请求的文件体积却很大.这时候最好的做法就是按需引入,动态引入组件 ...
- [dhcpd]清除dhcp缓存
修改了dhcp的default-lease-time && max-lease-time 清除缓存: rm /var/lib/dhcpd.leases~ echo "&quo ...
- HRBUST 1819
石子合并问题--圆形版 Time Limit: 1000 MS Memory Limit: 32768 K Total Submit: 61(27 users) Total Accepted: 26( ...
- CentOS 下安装 LEMP 服务(nginx、MariaDB/MySQL 和 php)
转载自:https://linux.cn/article-4314-1.html 编译自:http://xmodulo.com/install-lemp-stack-centos.html 作者: D ...
- c# 深拷贝与浅拷贝的区别分析及实例
浅拷贝(影子克隆):只复制对象的基本类型,对象类型,仍属于原来的引用. 深拷贝(深度克隆):不紧复制对象的基本类,同时也复制原对象中的对象.就是说完全是新对象产生的. 深拷贝是指源对象与拷贝对象互相独 ...
- linux用户登录指定目录
一.创建用户和用户组 [root@web4 lianyu]# groupadd lianyu [root@web4 lianyu]# useradd lianyu -g lianyu [root@we ...
- wamp环境介绍
一.简介 Wamp就是 Windows Apache Mysql PHP集成安装环境,即在window下的apache.php和mysql的服务器软件. 二.常用的集成环境 XAMPP - XAMPP ...
- frame外弹出,刷新父页面
//刷新父页面 function reflashParent() { var id = parent.tabbar.getActiveTab(); id = id.replace('tab','mai ...
- Ubuntu12.04 安装LAMP及phpmyadmin
1.安装 Apache apt-get install apache2 2.安装 PHP5 apt-get install php5 libapache2-mod-php5 3.安装 MySQL ap ...
- 【BZOJ4903】【CTSC2017】吉夫特 [DP]
吉夫特 Time Limit: 15 Sec Memory Limit: 512 MB[Submit][Status][Discuss] Description Input 第一行一个整数n. 接下 ...