python urllib2库的简单总结

urllib2的简单介绍
参考网址：http://www.voidspace.org.uk/python/articles/urllib2.shtml

Fetching URLs
The simplest way to use urllib2 is as follows :
1、
import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()

2、
import urllib2
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
the_page = response.read()
3、
req = urllib2.Request('ftp://example.com/')
4、
import urllib
import urllib2

url = 'http://www.someserver.com/cgi-bin/register.cgi'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }

data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()
5、
Data can also be passed in an HTTP GET request by encoding it in the URL itself.

>>> import urllib2
>>> import urllib
>>> data = {}
>>> data['name'] = 'Somebody Here'
>>> data['location'] = 'Northampton'
>>> data['language'] = 'Python'
>>> url_values = urllib.urlencode(data)
>>> print url_values
name=Somebody+Here&language=Python&location=Northampton
>>> url = 'http://www.example.com/example.cgi'
>>> full_url = url + '?' + url_values
>>> data = urllib2.urlopen(full_url)
6、
import urllib
import urllib2

url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }

data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()

7、Handling Exceptions
1）URLError
>>> req = urllib2.Request('http://www.jianshu.com/p/5c7a1af4aa531')
>>> try: urllib2.urlopen(req)
>>> except URLError, e:
>>> print e.reason
>>> print e,e.code #分别表示凡返回错误类型，错误代码和类型，错误代码

2）HTTPError is the subclass of URLError raised in the specific case of HTTP URLs.
HTTPError
Every HTTP response from the server contains a numeric "status code". Sometimes the status code indicates that the server is unable to fulfil the request
3）Error Codes
100: ('Continue', 'Request received, please continue'),
101: ('Switching Protocols',
'Switching to new protocol; obey Upgrade header'),

200: ('OK', 'Request fulfilled, document follows'),
201: ('Created', 'Document created, URL follows'),
202: ('Accepted',
'Request accepted, processing continues off-line'),
203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
204: ('No Content', 'Request fulfilled, nothing follows'),
205: ('Reset Content', 'Clear input form for further input.'),
206: ('Partial Content', 'Partial content follows.'),

300: ('Multiple Choices',
'Object has several resources -- see URI list'),
301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
302: ('Found', 'Object moved temporarily -- see URI list'),
303: ('See Other', 'Object moved -- see Method and URL list'),
304: ('Not Modified',
'Document has not changed since given time'),
305: ('Use Proxy',
'You must use proxy specified in Location to access this '
'resource.'),
307: ('Temporary Redirect',
'Object moved temporarily -- see URI list'),

400: ('Bad Request',
'Bad request syntax or unsupported method'),
401: ('Unauthorized',
'No permission -- see authorization schemes'),
402: ('Payment Required',
'No payment -- see charging schemes'),
403: ('Forbidden',
'Request forbidden -- authorization will not help'),
404: ('Not Found', 'Nothing matches the given URI'),
405: ('Method Not Allowed',
'Specified method is invalid for this server.'),
406: ('Not Acceptable', 'URI not available in preferred format.'),
407: ('Proxy Authentication Required', 'You must authenticate with '
'this proxy before proceeding.'),
408: ('Request Timeout', 'Request timed out; try again later.'),
409: ('Conflict', 'Request conflict.'),
410: ('Gone',
'URI no longer exists and has been permanently removed.'),
411: ('Length Required', 'Client must specify Content-Length.'),
412: ('Precondition Failed', 'Precondition in headers is false.'),
413: ('Request Entity Too Large', 'Entity is too large.'),
414: ('Request-URI Too Long', 'URI is too long.'),
415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
416: ('Requested Range Not Satisfiable',
'Cannot satisfy request range.'),
417: ('Expectation Failed',
'Expect condition could not be satisfied.'),

500: ('Internal Server Error', 'Server got itself in trouble'),
501: ('Not Implemented',
'Server does not support this operation'),
502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
503: ('Service Unavailable',
'The server cannot process the request due to a high load'),
504: ('Gateway Timeout',
'The gateway server did not receive a timely response'),
505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
4）
例子1：
from urllib2 import Request, urlopen, URLError, HTTPError
req = Request(someurl)
try:
response = urlopen(req)
except HTTPError, e:
print 'The server couldn\'t fulfill the request.'
print 'Error code: ', e.code
except URLError, e:
print 'We failed to reach a server.'
print 'Reason: ', e.reason
else:

注意：HTTPError 是URLError的子类，要写在前面
例子2：
from urllib2 import Request, urlopen, URLError
req = Request(someurl)
try:
response = urlopen(req)
except URLError, e:
if hasattr(e, 'reason'):
print 'We failed to reach a server.'
print 'Reason: ', e.reason
elif hasattr(e, 'code'):
print 'The server couldn\'t fulfill the request.'
print 'Error code: ', e.code
else:
# everything is fine

例子3：
from urllib2 import Request, urlopen
req = Request(someurl)
try:
response = urlopen(req)
except IOError, e:
if hasattr(e, 'reason'):
print 'We failed to reach a server.'
print 'Reason: ', e.reason
elif hasattr(e, 'code'):
print 'The server couldn\'t fulfill the request.'
print 'Error code: ', e.code
else:
# everything is fine
注意：URLError是IOError的子类，极少数情况下可能会报socket.error

8、info and geturl
geturl:
这将返回所获取页面的真实URL。这很有用，因为urlopen(或者使用的打开器对象)可能已经遵循了重定向。所获取的页面的URL可能与所请求的URL不相同
info:
这会返回一个类似于字典的对象，描述所获取的页面，特别是服务器发送的头部。它现在是一个httplib。HTTPMessage实例.
例子：
from urllib2 import Request,urlopen,URLError,HTTPError
url = 'https://passport.baidu.com/center?_t=1510744860'
req = Request(url)
response = urlopen(req)
print response.info()
print response.geturl()

9、Openers and Handlers
Openers：

　　当你获取一个URL你使用一个opener(一个urllib2.OpenerDirector的实例)。正常情况下，我们使用默认opener：通过urlopen。但你能够创建个性的openers。可以用build_opener来创建opener对象。一般可用于需要处理cookie或者不想进行redirection的应用场景（You will want to create openers if you want to fetch URLs with specific handlers installed, for example to get an opener that handles cookies, or to get an opener that does not handle redirections.）

　　以下是用代理ip模拟登录时（需要处理cookie）使用handler和opener的具体流程。

1 self.proxy = urllib2.ProxyHandler({'http': self.proxy_url})
2 self.cookie = cookielib.LWPCookieJar()
3 self.cookie_handler = urllib2.HTTPCookieProcessor(self.cookie)
4 self.opener = urllib2.build_opener(self.cookie_handler, self.proxy, urllib2.HTTPHandler)
Handles：

　　Openers使用处理器handlers，所有的“繁重”工作由handlers处理。每个handlers知道如何通过特定协议打开URLs，或者如何处理URL打开时的各个方面。例如HTTP重定向或者HTTP cookies。

更多关于Openers和Handlers的信息。http://www.voidspace.org.uk/python/articles/urllib2.shtml#openers-and-handlers

10、Proxies
proxy代理ip创建opener

Note：Currently urllib2 does not support fetching of https locations through a proxy. This can be a problem.
（http://www.voidspace.org.uk/python/articles/urllib2.shtml#proxies）

例子：
1 import urllib2
2 proxy——handler = urllib2.ProxyHandler({'http': '54.186.78.110:3128'})#注意要确保该代理ip可用
3 opener = urllib2.build_opener(proxy_handler)
4 request = urllib2.Request(url, post_data, login_headers)#该例中还需要提交post_data和header信息
5 response = opener.open(request)
6 print response.read().encode('utf-8')

11、Sockets and Layers
例子：
import socket
import urllib2

# timeout in seconds
timeout = 10
socket.setdefaulttimeout(timeout)

# this call to urllib2.urlopen now uses the default timeout
# we have set in the socket module
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)

12、Cookie
urllib2 对 Cookie 的处理也是自动的。如果需要得到某个 Cookie 项的值，可以这么做：
例子：
import urllib2
import cookielib
cookie = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
response = opener.open('http://www.baidu.com')
for item in cookie:
print 'Name = '+item.name
print 'Value = '+item.value

运行之后就会输出访问百度的Cookie值：
Name = BAIDUID
Value = C664216C4F7BD6B98DB0B300292E0A23:FG=1
Name = BIDUPSID
Value = C664216C4F7BD6B98DB0B300292E0A23
Name = H_PS_PSSID
Value = 1464_21099_17001_24879_22159
Name = PSTM
Value = 1510747061
Name = BDSVRTM
Value = 0
Name = BD_HOME
Value = 0
13、对付"反盗链"
某些站点有所谓的反盗链设置，其实说穿了很简单，
就是检查你发送请求的header里面，referer站点是不是他自己，
所以我们只需要像把headers的referer改成该网站即可，以baidu为例：
#...
headers = {
'Referer':'http://www.baidu.com/'
}
#...
headers是一个dict数据结构，你可以放入任何想要的header，来做一些伪装。
例如，有些网站喜欢读取header中的X-Forwarded-For来看看人家的真实IP，可以直接把X-Forwarde-For改了。

python urllib2库的简单总结的更多相关文章

python requests库的简单使用
requests是python的一个HTTP客户端库,跟urllib,urllib2类似,但比urllib,urllib2更加使用简单. 1. requests库的安装在你的终端中运行pip安装命令即 ...
python第三方库requests简单介绍
一.发送请求与传递参数简单demo: import requests r = requests.get(url='http://www.itwhy.org') # 最基本的GET请求 print(r ...
Python Pexpect库的简单使用
Python Pexpect库的使用简介最近需要远程操作一个服务器并执行该服务器上的一个python脚本,查到可以使用Pexpect这个库.记录一下. 什么是Pexpect?Pexpect能够产生 ...
Python turtle库绘制简单图形
一.简介 Python中的turtle库是一个直观有趣的图形绘制函数库.turtle库绘制图形有一个基本框架:一个小海龟在坐标系中爬行,其爬行轨迹形成了绘制图形. 二.简单的图形列举 1.绘制4个不同 ...
python requests库的简单运用
python requests的简单运用使用pycharm获取requests包 ctrl+alt+s Project:pythonProject pythoninterpreter 点+号搜索使 ...
python urllib2练习发送简单post
import urllib2 import urllib url = 'http://localhost/1.php' while True: data = raw_input('(ctrl+c ex ...
python第三方库，你要的这里都有
Python的第三方库多的超出我的想象. python 第三方模块转 https://github.com/masterpy/zwpy_lst Chardet,字符编码探测器,可以自动检测文本. ...
python常用库（转）
转自http://www.west999.com/info/html/wangluobiancheng/qita/20180729/4410114.html Python常用的库简单介绍一下 fuzz ...
Python全部库整理
库名称简介 Chardet字符编码探测器,可以自动检测文本.网页.xml的编码. colorama主要用来给文本添加各种颜色,并且非常简单易用. Prettytable主要用于在终端或浏览器端构建格式 ...

随机推荐

php 正则表达式一.函数解析
php正则表达式官方手册参考....... 一.php中常用的正则表达式函数 1.preg_match与preg_match_all preg_match: 函数信息 preg_match_all: ...
http://www.nirsoft.net/about_nirsoft_freeware.html
http://www.nirsoft.net/about_nirsoft_freeware.html
解决：function in namespace ‘std’ does not name a type + allocator_/nullptr/dellocator_ was not declared + base operand of ‘->’ has non-pointer type ‘std::vector<cv::Mat>’ 错误编译时报错（caffe）
解决方法,用到了c++11,g++命令需要加上-std=c++11选项附:g++默认的c++标准 gcc-6.4.0 gcc-7.2.0 默认是 -std=gnu++14gcc-4.3.6 gcc- ...
spring 整合mybatis找不到${jdbc.driverClass}
1.检查是否设置了mapper扫描org.mybatis.spring.mapper.MapperScannerConfigurer类在spring里使用org.mybatis.spring.map ...
Linux学习笔记（9）linux网络管理与配置之一——Linux基础网络命令与学习大纲（0）
大纲目录 0.常用linux基础网络命令 1.配置主机名 2.配置网卡信息与IP地址 3.配置DNS客户端 4.配置名称解析顺序 5.配置路由与默认网关 6.双网卡绑定 [1] ping [2]net ...
简明python教程六----编写一个python脚本
备份程序: #!/usr/bin/python #Filename:backup_ver1.py import os import time source = ['/home/liuxj/python ...
Linux主从同步监测和利用sendMail来发邮件
首先介绍下sendMail About SendEmailSendEmail is a lightweight, command line SMTP email client. If you have ...
【转】XML的几种读写
XML文件是一种常用的文件格式,例如WinForm里面的app.config以及Web程序中的web.config文件,还有许多重要的场所都有它的身影.Xml是Internet环境中跨平台的,依赖于内 ...
day3-python的函数及参数
函数式编程最重要的是增强代码的重用性和可读性 1 2 3 4 def 函数名(参数): ... 函数体 ... 函数的定义主要有如下要点: def:表示函数的关键字函数名:函 ...
Team Foundation 中的错误和事件消息
Visual Studio Team System Team Foundation 中的错误和事件消息 Team Foundation 通过显示错误消息和事件消息来通知您操作成功以及操作失败.一部分错 ...

python urllib2库的简单总结

python urllib2库的简单总结的更多相关文章

随机推荐

热门专题