Python HTTP库requests中文页面乱码解决方案!
最近写过好几个爬虫,熟悉了下python requests库的用法,这个库真的Python的官方api接口好用多了。美中不足的是:这个库好像对中文的支持不是很友好,有些页面会出现乱码,然后换成urllib后,问题就没有了。由于requests库最终使用的是urllib3作为底层传输适配器,requests只是把urllib3库读取的原始进行人性化的处理,所以问题requests库本身上!于是决定阅读库源码,解决该中文乱码问题;一方面,也是希望加强自己对HTTP协议、Python的理解。

- >>> req = requests.get('')
- >>> req
- <Response [200]>
- >>> print req.text[:100]
- FILE: /usr/lib/python2.7/dist-packages/requests/models.pyc, LINE: 770 <==> ISO-8859-1
- FILE: /usr/lib/python2.7/dist-packages/requests/models.pyc, LINE: 781 <==> ISO-8859-1
- <!DOCTYPE html>
- <html>
- <head>
- <meta charset="gbk" />
- <title>¾©¶«(JD.COM)-×ÛºÏÍø¹ºÊ×Ñ¡-ÕýÆ·µÍ¼Û¡¢Æ·ÖÊ
- # 这里出现了乱码
- >>> dir(req)
- ['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getstate__', '__hash__', '__init__', '__iter__', '__module__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_content', '_content_consumed', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']


- >>> print req.content[:100]
- <!DOCTYPE html>
- <html>
- <head>
- <meta charset="gbk" />
- <title>¾©¶«(JD.COM)-؛ºЍ닗ѡ-ֽƷµͼۡ¢Ʒ
- >>>
- >>>
- >>> print req.content.decode('gbk')[:100]
- <!DOCTYPE html>
- <html>
- <head>
- <meta charset="gbk" />
- <title>京东(JD.COM)-综合网购首选-正品低价、品质保障、配送及时、轻松购物!</
- ## 由于该页面时gbk编码的,而Linux是utf-8编码,所以打印肯定是乱码,我们先进行解码。就能正确显示了。


- >>> print req.text[:100]
- FILE: /usr/lib/python2.7/dist-packages/requests/models.pyc, LINE: 770 <==> ISO-8859-1
- FILE: /usr/lib/python2.7/dist-packages/requests/models.pyc, LINE: 781 <==> ISO-8859-1
- <!DOCTYPE html>
- <html>
- <head>
- <meta charset="gbk" />
- <title>¾©¶«(JD.COM)-×ÛºÏÍø¹ºÊ×Ñ¡-ÕýÆ·µÍ¼Û¡¢Æ·ÖÊ
- >>> print req.text.decode('gbk')[:100]
- FILE: /usr/lib/python2.7/dist-packages/requests/models.pyc, LINE: 770 <==> ISO-8859-1
- FILE: /usr/lib/python2.7/dist-packages/requests/models.pyc, LINE: 781 <==> ISO-8859-1
- Traceback (most recent call last):
- File "<stdin>", line 1, in <module>
- UnicodeEncodeError: 'ascii' codec can't encode characters in position 60-63: ordinal not in range(128)
- # 对text属性进行解码,就会出现错误。


- # /requests/
@property- def content(self):
- """Content of the response, in bytes."""
- if self._content is False:
- # Read the contents.
- try:
- if self._content_consumed:
- raise RuntimeError(
- 'The content for this response was already consumed')
- if self.status_code == 0:
- self._content = None
- else:
- self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
- except AttributeError:
- self._content = None
- self._content_consumed = True
- # don't need to release the connection; that's been handled by urllib3
- # since we exhausted the data.
- return self._content

- # requests/
- @property
- def text(self):
- """Content of the response, in unicode.
- If Response.encoding is None, encoding will be guessed using
- ``chardet``.
- The encoding of the response content is determined based solely on HTTP
- headers, following RFC 2616 to the letter. If you can take advantage of
- non-HTTP knowledge to make a better guess at the encoding, you should
- set ``r.encoding`` appropriately before accessing this property.
- """
- # Try charset from content-type
- content = None
- encoding = self.encoding
- if not self.content:
- return str('')
- # Fallback to auto-detected encoding.
- if self.encoding is None:
- encoding = self.apparent_encoding
- # Decode unicode from given encoding.
- try:
- content = str(self.content, encoding, errors='replace')
- except (LookupError, TypeError):
- # A LookupError is raised if the encoding was not found which could
- # indicate a misspelling or similar mistake.
- #
- # A TypeError can be raised if encoding is None
- #
- # So we try blindly encoding.
- content = str(self.content, errors='replace')
- return content

看看注和源码知道,content是urllib3读取回来的原始字节码,而text不过是尝试对content通过编码方式解码为unicode。 页面为gbk编码,问题就出在这里。
- >>> req.apparent_encoding;req.encoding'GB2312'
- 'ISO-8859-1'
- # rquests/
- @property
- def apparent_encoding(self):
- """The apparent encoding, provided by the chardet library"""
- return chardet.detect(self.content)['encoding']

- # requests/
- def build_response(self, req, resp):
- """Builds a :class:`Response <requests.Response>` object from a urllib3
- response. This should not be called from user code, and is only exposed
- for use when subclassing the
- :class:`HTTPAdapter <requests.adapters.HTTPAdapter>`
- :param req: The :class:`PreparedRequest <PreparedRequest>` used to generate the response.
- :param resp: The urllib3 response object.
- """
- response = Response()
- # Fallback to None if there's no status_code, for whatever reason.
- response.status_code = getattr(resp, 'status', None)
- # Make headers case-insensitive.
- response.headers = CaseInsensitiveDict(getattr(resp, 'headers', {}))
- # Set encoding.
- response.encoding = get_encoding_from_headers(response.headers)
- # .......


- # requests/
- def get_encoding_from_headers(headers):
- """Returns encodings from given HTTP Header Dict.
- :param headers: dictionary to extract encoding from.
- """
- content_type = headers.get('content-type')
- if not content_type:
- return None
- content_type, params = cgi.parse_header(content_type)
- if 'charset' in params:
- return params['charset'].strip("'\"")
- if 'text' in content_type:
- return 'ISO-8859-1'

发现了吗?程序只通过http响应首部获取编码,假如响应中,没有指定charset, 那么直接返回'ISO-8859-1'。
可以看到,reqponse header只指定了type,但是没有指定编码(一般现在页面编码都直接在html页面中)。所有该函数就直接返回'ISO-8859-1'。
1. 修改get_encoding_from_headers函数,通过正则匹配,来检测页面编码。由于现在的页面都在HTML代码中指定了charset,所以通过正则式匹配的编码方式是完全正确的。
2. 由于content是HTTP相应的原始字节串,所以我们需要直接可以通过使用它。把content按照页面编码方式解码为unicode!
