Python爬虫【requests】request for humans

安装

 pip install requests

源码

git clone git://github.com/kennethreitz/requests.git

导入

import requests

发送请求

get请求

r = requests.get('https://api.github.com/events')

post请求

r = requests.post('http://httpbin.org/post', data = {'key':'value'})

其他

>>> r = requests.put('http://httpbin.org/put', data = {'key':'value'})

>>> r = requests.delete('http://httpbin.org/delete')

>>> r = requests.head('http://httpbin.org/get')

>>> r = requests.options('http://httpbin.org/get')

传递URL参数

1.get请求携带参数

>>> payload = {'key1': 'value1', 'key2': 'value2'}

>>> r = requests.get("http://httpbin.org/get", params=payload)

>>> print(r.url)

http://httpbin.org/get?key2=value2&key1=value1

携带参数值为列表

>>> payload = {'key1': 'value1', 'key2': ['value2', 'value3']}

>>> r = requests.get('http://httpbin.org/get', params=payload)

>>> print(r.url)

http://httpbin.org/get?key1=value1&key2=value2&key2=value3

2.post请求

如果要将参数放在request body中传递，使用data参数，可以是字典，字符串或者是类文件对象。

使用字典时将发送form-encoded data：

>>> payload = {'key1': 'value1', 'key2': 'value2'}

>>> r = requests.post("http://httpbin.org/post", data=payload)

>>> print(r.text)

{

  ...

  "form": {

    "key2": "value2",

    "key1": "value1"

  },

  ...

}

application/json

>>> import json

>>> url = 'https://api.github.com/some/endpoint'

>>> payload = {'some': 'data'}

>>> r = requests.post(url, data=json.dumps(payload))

流式上传

with open('massive-body', 'rb') as f:

    requests.post('http://some.url/streamed', data=f)

块编码请求

def gen():

    yield 'hi'

    yield 'there'

requests.post('http://some.url/chunked', data=gen())

如果要上传文件，可以使用file参数发送Multipart-encoded数据，file参数是{ 'name': file-like-objects}格式的字典 (or {'name':('filename', fileobj)}) ：

>>> url = 'http://httpbin.org/post'

>>> files = {'file': open('report.xls', 'rb')}

>>> r = requests.post(url, files=files)

>>> r.text

{

  ...

  "files": {

    "file": "<censored...binary...data>"

  },

  ...

}

也可以明确设置filename, content_type and headers：

 >>> url = 'http://httpbin.org/post'

 >>> files = {'file': ('report.xls', open('report.xls', 'rb'), 'application/vnd.ms-excel', {'Expires': ''})}

 >>> r = requests.post(url, files=files)

 >>> print r.text

 {

   "args": {},

   "data": "",

   "files": {

     "file": "1\t2\r\n"

   },

   "form": {},

   "headers": {

     "Content-Type": "multipart/form-data; boundary=e0f9ff1303b841498ae53a903f27e565",

     "Host": "httpbin.org",

     "User-Agent": "python-requests/2.2.1 CPython/2.7.3 Windows/7",

   },

   "url": "http://httpbin.org/post"

 }

一次性上传多个文件，比如可以接受多个值的文件上传：

<input type="file" name="images" multiple="true" required="true"/>

只要把文件放到一个元组的列表中，其中元组结构为(form_field_name, file_info):

>>> url = 'http://httpbin.org/post'

>>> multiple_files = [('images', ('foo.png', open('foo.png', 'rb'), 'image/png')),

                      ('images', ('bar.png', open('bar.png', 'rb'), 'image/png'))]

>>> r = requests.post(url, files=multiple_files)

>>> r.text

{

  ...

  'files': {'images': 'data:image/png;base64,iVBORw ....'}

  'Content-Type': 'multipart/form-data; boundary=3131623adb2043caaeb5538cc7aa0b3a',

  ...

}

响应内容

>>> import requests

>>> r = requests.get('https://api.github.com/events')

>>> r.text

u'[{"repository":{"open_issues":0,"url":"https://github.com/...

解码

>>> r.encoding

'utf-8'

>>> r.encoding = 'ISO-8859-1'

一般这样子用

r.content.decode('utf-8')

二进制响应内容

非文本请求，字节形式

>>> r.content

b'[{"repository":{"open_issues":0,"url":"https://github.com/...

Requests 会自动为你解码 gzip 和 deflate 传输编码的响应数据。

例如，以请求返回的二进制数据创建一张图片，你可以使用如下代码：

>>> from PIL import Image

>>> from io import BytesIO

>>> i = Image.open(BytesIO(r.content))

JSON 响应内容

内置json解码器

>>> import requests

>>> r = requests.get('https://api.github.com/events')

>>> r.json()

[{u'repository': {u'open_issues': 0, u'url': 'https://github.com/...

原始响应内容

确保stream=True

>>> r = requests.get('https://api.github.com/events', stream=True)

>>> r.raw

<requests.packages.urllib3.response.HTTPResponse object at 0x101194810>

>>> r.raw.read(10)

'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'

一般这样子用

with open(filename, 'wb') as fd:

    for chunk in r.iter_content(chunk_size):

        fd.write(chunk)

定制请求头

只要简单地传递一个 dict 给 headers 参数就可以了

>>> url = 'https://api.github.com/some/endpoint'

>>> headers = {'user-agent': 'my-app/0.0.1'}

>>> r = requests.get(url, headers=headers)

响应状态码

>>> r = requests.get('http://httpbin.org/get')

>>> r.status_code

200

状态码原因短语

>>> r.status_code == requests.codes.ok

True

发送错误请求，通过 raise_for_status 抛出异常，当状态码为200，返回None

>>> bad_r = requests.get('http://httpbin.org/status/404')

>>> bad_r.status_code

404

>>> bad_r.raise_for_status()

Traceback (most recent call last):

  File "requests/models.py", line 832, in raise_for_status

    raise http_error

requests.exceptions.HTTPError: 404 Client Error

响应头

Python 字典形式展示的服务器响应头

>>> r.headers

{

    'content-encoding': 'gzip',

    'transfer-encoding': 'chunked',

    'connection': 'close',

    'server': 'nginx/1.0.4',

    'x-runtime': '148ms',

    'etag': '"e1ca502697e5c9317743dc078f67693f"',

    'content-type': 'application/json'

}

由于HTTP头部大小写不敏感，我们可以这样使用

>>> r.headers['Content-Type']

'application/json'

>>> r.headers.get('content-type')

'application/json'

它还有一个特殊点，那就是服务器可以多次接受同一 header，每次都使用不同的值。但 Requests 会将它们合并，将每个后续的栏位值依次追加到合并的栏位值中，用逗号隔开即可，

Cookie

>>> url = 'http://example.com/some/cookie/setting/url'

>>> r = requests.get(url)

>>> r.cookies['example_cookie_name']

'example_cookie_value'

构建cookies请求

>>> url = 'http://httpbin.org/cookies'

>>> cookies = dict(cookies_are='working')

>>> r = requests.get(url, cookies=cookies)

>>> r.text

'{"cookies": {"cookies_are": "working"}}'

Cookie 的返回对象为 RequestsCookieJar，它的行为和字典类似，但接口更为完整，适合跨域名跨路径使用。你还可以把 Cookie Jar 传到 Requests 中：

>>> jar = requests.cookies.RequestsCookieJar()

>>> jar.set('tasty_cookie', 'yum', domain='httpbin.org', path='/cookies')

>>> jar.set('gross_cookie', 'blech', domain='httpbin.org', path='/elsewhere')

>>> url = 'http://httpbin.org/cookies'

>>> r = requests.get(url, cookies=jar)

>>> r.text

'{"cookies": {"tasty_cookie": "yum"}}'

cookie转为字典

>>> requests.utils.dict_from_cookiejar(r.cookies)

{'BAIDUID': '84722199DF8EDC372D549EC56CA1A0E2:FG=1', 'BD_HOME': '', 'BDSVRTM': ''}

将字典转为CookieJar：

requests.utils.cookiejar_from_dict(cookie_dict, cookiejar=None, overwrite=True)

会话对象

requests提供了一个Session类，来保持cookie，可用于访问登录后的页面

s = requests.Session()

s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')

r = s.get("http://httpbin.org/cookies")

print(r.text)

# '{"cookies": {"sessioncookie": "123456789"}}'

会话也可用来为请求方法提供缺省数据。这是通过为会话对象的属性提供数据来实现的：

s = requests.Session()

s.auth = ('user', 'pass')

s.headers.update({'x-test': 'true'})

# both 'x-test' and 'x-test2' are sent

s.get('http://httpbin.org/headers', headers={'x-test2': 'true'})

任何你传递给请求方法的字典都会与已设置会话层数据合并。

方法层的参数会覆盖会话的参数。

不过需要注意，就算使用了会话，方法级别的参数也不会被跨请求保持。下面的例子只会在第一个请求发送 cookie ，而第二个不会发送cookie：

s = requests.Session()

r = s.get('http://httpbin.org/cookies', cookies={'from-my': 'browser'})

print(r.text)

# '{"cookies": {"from-my": "browser"}}'

r = s.get('http://httpbin.org/cookies')

print(r.text)

# '{"cookies": {}}'

前后文管理会话

with requests.Session() as s:

    s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')

SSH证书认证

Requests 可以为 HTTPS 请求验证 SSL 证书，就像 web 浏览器一样。SSL 验证默认是开启的，如果证书验证失败，Requests 会抛出 SSLError:

>>> requests.get('https://requestb.in')

requests.exceptions.SSLError: hostname 'requestb.in' doesn't match either of '*.herokuapp.com', 'herokuapp.com'

在该域名上我没有设置 SSL，所以失败了。但 Github 设置了 SSL:

>>> requests.get('https://github.com', verify=True)

<Response [200]>

为 verify 传入 CA_BUNDLE 文件的路径，或者包含可信任 CA 证书文件的文件夹路径：

>>> requests.get('https://github.com', verify='/path/to/certfile')

也可以保持在会话中

s = requests.Session()

s.verify = '/path/to/certfile'

如果 verify 设为文件夹路径，文件夹必须通过 OpenSSL 提供的 c_rehash 工具处理。

忽略证书设置verify为False

>>> requests.get('https://kennethreitz.org', verify=False)

<Response [200]>

默认情况下， verify 是设置为 True 的。选项 verify 仅应用于主机证书。

客户端证书

单个文件（包含密钥和证书【pem】）或一个包含两个文件路径的元组

>>> requests.get('https://kennethreitz.org', cert=('/path/client.cert', '/path/client.key'))

保持在会话中

s = requests.Session()

s.cert = '/path/client.cert'

本地证书的私有 key 必须是解密状态。目前，Requests 不支持使用加密的 key。

证书出错

>>> requests.get('https://kennethreitz.org', cert='/wrong_path/client.pem')

SSLError: [Errno 336265225] _ssl.c:347: error:140B0009:SSL routines:SSL_CTX_use_PrivateKey_file:PEM lib

CA 证书

Requests 默认附带了一套它信任的根证书，来自于 Mozilla trust store。然而它们在每次 Requests 更新时才会更新。这意味着如果你固定使用某一版本的 Requests，你的证书有可能已经太旧了。

从 Requests 2.4.0 版之后，如果系统中装了 certifi 包，Requests 会试图使用它里边的证书。这样用户就可以在不修改代码的情况下更新他们的可信任证书。

为了安全起见，我们建议你经常更新 certifi！

响应体内容工作流

默认情况下，当你进行网络请求后，响应体会立即被下载。你可以通过 stream 参数覆盖这个行为，推迟下载响应体直到访问 Response.content 属性：

tarball_url = 'https://github.com/kennethreitz/requests/tarball/master'

r = requests.get(tarball_url, stream=True)

此时仅有响应头被下载下来了，连接保持打开状态，因此允许我们根据条件获取内容：

if int(r.headers['content-length']) < TOO_LONG:

  content = r.content

  ...

你可以进一步使用 Response.iter_content 和 Response.iter_lines 方法来控制工作流，或者以 Response.raw 从底层 urllib3 的 urllib3.HTTPResponse <urllib3.response.HTTPResponse 读取未解码的响应体。

如果你在请求中把 stream 设为 True，Requests 无法将连接释放回连接池，除非你消耗了所有的数据，或者调用了 Response.close。这样会带来连接效率低下的问题。如果你发现你在使用 stream=True 的同时还在部分读取请求的 body（或者完全没有读取 body），那么你就应该考虑使用 with 语句发送请求，这样可以保证请求一定会被关闭：

with requests.get('http://httpbin.org/get', stream=True) as r:

    # 在此处理响应。

事件挂钩

Requests有一个钩子系统，你可以用来操控部分请求过程，或信号事件处理。

钩子：

response：从一个请求产生的响应

你可以通过传递一个 {hook_name: callback_function} 字典给 hooks 请求参数为每个请求分配一个钩子函数：

hooks=dict(response=print_url)

callback_function 会接受一个数据块作为它的第一个参数。

def print_url(r, *args, **kwargs):

    print(r.url)

>>> requests.get('http://httpbin.org', hooks=dict(response=print_url))

http://httpbin.org

<Response [200]>

自定义身份验证

任何传递给请求方法的 auth 参数的可调用对象，在请求发出之前都有机会修改请求。

定义子类继承 requests.auth.AuthBase ，两种常见的身份验证方案：HTTPBasicAuth 和 HTTPDigestAuth 。

假设我们有一个web服务，仅在 X-Pizza 头被设置为一个密码值的情况下才会有响应

from requests.auth import AuthBase

class PizzaAuth(AuthBase):

    """Attaches HTTP Pizza Authentication to the given Request object."""

    def __init__(self, username):

        # setup any auth-related data here

        self.username = username

    def __call__(self, r):

        # modify and return the request

        r.headers['X-Pizza'] = self.username

        return r

>>> requests.get('http://pizzabin.org/admin', auth=PizzaAuth('kenneth'))

<Response [200]>

流式请求

简单地设置 stream 为 True 便可以使用 iter_lines 对相应进行迭代：

import json

import requests

r = requests.get('http://httpbin.org/stream/20', stream=True)

for line in r.iter_lines():

    # filter out keep-alive new lines

    if line:

        decoded_line = line.decode('utf-8')

        print(json.loads(decoded_line))

当使用 decode_unicode=True 在 Response.iter_lines() 或 Response.iter_content() 中时，你需要提供一个回退编码方式，以防服务器没有提供默认回退编码，从而导致错误：

r = requests.get('http://httpbin.org/stream/20', stream=True)

if r.encoding is None:

    r.encoding = 'utf-8'

for line in r.iter_lines(decode_unicode=True):

    if line:

        print(json.loads(line))

代理

参数为proxies

import requests

proxies = {

  "http": "http://10.10.1.10:3128",

  "https": "http://10.10.1.10:1080",

}

requests.get("http://example.org", proxies=proxies)

你也可以通过环境变量 HTTP_PROXY 和 HTTPS_PROXY 来配置代理

$ export HTTP_PROXY="http://10.10.1.10:3128"

$ export HTTPS_PROXY="http://10.10.1.10:1080"

$ python

>>> import requests

>>> requests.get("http://example.org")

若你的代理需要使用HTTP Basic Auth，可以使用 http://user:password@host/ 语法：

proxies = {

    "http": "http://user:pass@10.10.1.10:3128/",

}

要为某个特定的连接方式或者主机设置代理，使用 scheme://hostname 作为 key，它会针对指定的主机和连接方式进行匹配。

proxies = {'http://10.20.1.128': 'http://10.10.1.10:5323'}

代理 URL 必须包含连接方式。

socks代理

安装

 pip install requests[socks]

proxies = {

    'http': 'socks5://user:pass@host:port',

    'https': 'socks5://user:pass@host:port'

}

重定向与请求历史

默认情况下，除了 HEAD, Requests 会自动处理所有重定向。

可以使用响应对象的 history 方法来追踪重定向。

Response.history 是一个 Response 对象的列表，为了完成请求而创建了这些对象。这个对象列表按照从最老到最近的请求进行排序。

>>> r = requests.get('http://github.com')

>>> r.url

'https://github.com/'

>>> r.status_code

200

>>> r.history

[<Response [301]>]

如果你使用的是GET、OPTIONS、POST、PUT、PATCH 或者 DELETE，那么你可以通过 allow_redirects 参数禁用重定向处理：

>>> r = requests.get('http://github.com', allow_redirects=False)

>>> r.status_code

301

>>> r.history

[]

如果你使用了 HEAD，你也可以启用重定向：

>>> r = requests.head('http://github.com', allow_redirects=True)

>>> r.url

'https://github.com/'

>>> r.history

[<Response [301]>]

超时timeout

超时停止等待响应

>>> requests.get('http://github.com', timeout=0.001)

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

requests.exceptions.Timeout: HTTPConnectionPool(host='github.com', port=80): Request timed out. (timeout=0.001)

timeout 仅对连接过程有效，与响应体的下载无关。 timeout 并不是整个下载响应的时间限制，而是如果服务器在 timeout 秒内没有应答，将会引发一个异常（更精确地说，是在timeout 秒内没有从基础套接字上接收到任何字节的数据时）If no timeout is specified explicitly, requests do not time out.

错误与异常

遇到网络问题（如：DNS 查询失败、拒绝连接等）时，Requests 会抛出一个 ConnectionError 异常。

如果 HTTP 请求返回了不成功的状态码， Response.raise_for_status() 会抛出一个 HTTPError异常。

若请求超时，则抛出一个 Timeout 异常。

若请求超过了设定的最大重定向次数，则会抛出一个 TooManyRedirects 异常。

所有Requests显式抛出的异常都继承自 requests.exceptions.RequestException 。