1、什么是Urllib?

2、相比Python2的变化

3、用法讲解

(1)urlopen

  1. urlllb.request.urlopen(url,data=None[timeout,],cahle=None,capath=None,cadefault=False,context=None)
  2. #第一个参数为url网址,第二个参数为额外的数据,第三个参数为超时的设置,剩下的参数暂时用不到
  1. ######### GET 类型的请求 #############
  2. import urllib.request
  3. response =urllib.request.urlopen("http://ww.baidu.com")
  4. print(response.read().decode("utf-8")
  1. <!DOCTYPE html>
  2. <!--STATUS OK-->
  3.  
  4. ·······················
  5.  
  6. ······················
  7.  
  8. ·····················
  9.  
  10. <script>
  11. if(navigator.cookieEnabled){
  12. document.cookie="NOJS=;expires=Sat, 01 Jan 2000 00:00:00 GMT";
  13. }
  14. </script>
  15.  
  16. </body>
  17. </html>

打印的结果为:

  1. ######### POST 类型的请求 #############
  2. import urllib.request
  3. import urllib.parse
  4.  
  5. data=bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf8')
  6. response=urllib.request.urlopen("http://httpbin.org/post",data=data) # http://httpbin.org/post HTTP测试的网址
  7. print(response.read())
  1. b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "word": "hello"\n }, \n "headers": {\n "Accept-Encoding": "identity", \n "Connection": "close", \n "Content-Length": "10", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "httpbin.org", \n "User-Agent": "Python-urllib/3.5"\n }, \n "json": null, \n "origin": "221.208.253.76", \n "url": "http://httpbin.org/post"\n}\n'

打印的结果为:

  1. import urllib.request
  2. ############### 超时的设置 ###############
  3. response=urllib.request.urlopen("http://httpbin.org/get",timeout=1) # 设置一个超时的时间,在规定的时间没有响应,则抛出异常
  4. print(response.read())
  1. b'{\n "args": {}, \n "headers": {\n "Accept-Encoding": "identity", \n "Connection": "close", \n "Host": "httpbin.org", \n "User-Agent": "Python-urllib/3.5"\n }, \n "origin": "221.208.253.76", \n "url": "http://httpbin.org/get"\n}\n'

打印的结果为:

  1. import urllib.request
  2. import urllib.error
  3. import socket
  4.  
  5. ############### 超时的设置,超出响应时间 ###############
  6.  
  7. try:
  8. response = urllib.request.urlopen('htp://httpbin.org/get', timeout=0.1)
  9. except urllib.error.URLError as e:
  10. if isinstance(e.reason,socket.timeout):
  11. print("Time out")
  1. Time out

打印的结果为:

 (2)响应

响应类型

  1. import urllib.request
  2. response=urllib.request.urlopen('https://www.python.org')
  3. print(type(response))
  1. <class 'http.client.HTTPResponse'>

打印的结果为:

状态码、响应头

  1. import urllib.request
  2. response =urllib.request.urlopen('https://www.python.org')
  3. print(response.status) # 获取状态码
  4. print(response.getheaders) # 获取响应头
  5. print(response.getheader('Server'))
  1. 200
  2. <bound method HTTPResponse.getheaders of <http.client.HTTPResponse object at 0x0000000002D04EB8>>
  3. nginx

打印的结果为:

(3)request

  1. import urllib.request
  2. request=urllib.request.Request("https://python.org")
  3. response=urllib.request.urlopen(request)
  4. print(response.read().decode("utf-8"))
  1. <!doctype html>
  2. <!--[if lt IE 7]> <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9"> <![endif]-->
  3. <!--[if IE 7]> <html class="no-js ie7 lt-ie8 lt-ie9"> <![endif]-->
  4. <!--[if IE 8]> <html class="no-js ie8 lt-ie9"> <![endif]-->
  5. <!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr"> <!--<![endif]-->
  6.  
  7. <head>
  8.  
  9. ·················
  10.  
  11. ····················
  12.  
  13. </body>
  14. </html>

打印的结果为:

  1. from urllib import request,parse
  2. url='http://httpbin.org/post'
  3.  
  4. ############ POST 请求 ###############
  5. headers={
  6. "User-Agent":"Mozilla/4.0(compatible;MSIE 5.5;Windows NT)",
  7. "Host":'httpbin.org'
  8. }
  9. dict={
  10. 'name':"Germey"
  11. }
  12. data =bytes(parse.urlencode(dict),encoding="utf-8")
  13. req =request.Request(url=url,data=data,headers=headers,method='POST')
  14. response=request.urlopen(req)
  15. print(response.read().decode('utf-8'))
  1. {
  2. "args": {},
  3. "data": "",
  4. "files": {},
  5. "form": {
  6. "name": "Germey"
  7. },
  8. "headers": {
  9. "Accept-Encoding": "identity",
  10. "Connection": "close",
  11. "Content-Length": "",
  12. "Content-Type": "application/x-www-form-urlencoded",
  13. "Host": "httpbin.org",
  14. "User-Agent": "Mozilla/4.0(compatible;MSIE 5.5;Windows NT)"
  15. },
  16. "json": null,
  17. "origin": "221.208.253.76",
  18. "url": "http://httpbin.org/post"
  19. }

打印的结果为:


  1. from urllib import request,parse
  2. url ="http://httpbin.org/post"
  3. dict={
  4. 'name':'Germey'
  5. }
  6. data =bytes(parse.urlencode(dict),encoding='utf8')
  7. req = request.Request(url=url,data=data,method="POST")
  8. req.add_header('User-Agent','Mozilla/4.0(compatible;MSIE5.5;Windows NT)')
  9. response = request.urlopen(req)
  10. print(response.read().decode('utf-8'))
  1. {
  2. "args": {},
  3. "data": "",
  4. "files": {},
  5. "form": {
  6. "name": "Germey"
  7. },
  8. "headers": {
  9. "Accept-Encoding": "identity",
  10. "Connection": "close",
  11. "Content-Length": "",
  12. "Content-Type": "application/x-www-form-urlencoded",
  13. "Host": "httpbin.org",
  14. "User-Agent": "Mozilla/4.0(compatible;MSIE5.5;Windows NT)"
  15. },
  16. "json": null,
  17. "origin": "221.208.253.76",
  18. "url": "http://httpbin.org/post"
  19. }

打印的结果为:

(4)Handler

代理

  1. import urllib.request
  2. proxy_handler = urllib.request.ProxyHandler({
  3. 'http':'http://127.0.0.1:9743', # 代理http
  4. 'https':'https://127.0.0.1:9743' # 代理https
  5. })
  6. opener =urllib.request.build_opener(proxy_handler)
  7. response=opener.open("http://www.baidu.com")
  8. print(response.read())
  1. 因为我没有代理,所以打印出来的结果为:
  2.  
  3. urllib.error.URLError: <urlopen error [WinError 10061] 由于目标计算机积极拒绝,无法连接。>

打印的结果为:

Cookie

  1. import http.cookiejar,urllib.request
  2. cookie=http.cookiejar.CookieJar() # 获取Cookie信息
  3. handler=urllib.request.HTTPCookieProcessor(cookie) # 把Cookie信息放入到 handler中
  4. opener=urllib.request.build_opener(handler) # 建立opener
  5. response=opener.open("http://www.baidu.com")
  6. for item in cookie:
  7. print(item.name+"=”+item.value)
  1. BAIDUID=DDCB4C216AE8EE90C7D95E7AF8FA577F:FG=1
  2. BIDUPSID=DDCB4C216AE8EE90C7D95E7AF8FA577F
  3. H_PS_PSSID=1452_21078_26350_27111
  4. PSTM=1536830732
  5. BDSVRTM=0
  6. BD_HOME=0
  7. delPer=0

打印的结果为:

  1. ########### 把Cookie 保存成文件 ##########
  2. import http.cookiejar,urllib.request
  3. filename = "cookie.txt"
  4. cookie=http.cookiejar.MozillaCookieJar(filename)
  5. handler=urllib.request.HTTPCookieProcessor(cookie)
  6. opener=urllib.request.build_opener(handler)
  7. response=opener.open("http://www.baidu.com")
  8. cookie.save(ignore_discard=True,ignore_expires=True)
  1. 在工程目录下多了一个cookie.txt文件
  2.  
  3. 该文件的内容为:
  4.  
  5. # Netscape HTTP Cookie File
  6. # http://curl.haxx.se/rfc/cookie_spec.html
  7. # This is a generated file! Do not edit.
  8.  
  9. .baidu.com TRUE / FALSE 3684314677 BAIDUID CB67C520D33E28D7204C570EB7DFA28F:FG=1
  10. .baidu.com TRUE / FALSE 3684314677 BIDUPSID CB67C520D33E28D7204C570EB7DFA28F
  11. .baidu.com TRUE / FALSE H_PS_PSSID 1434_21113_26350_20930
  12. .baidu.com TRUE / FALSE 3684314677 PSTM 1536831034
  13. www.baidu.com FALSE / FALSE BDSVRTM 0
  14. www.baidu.com FALSE / FALSE BD_HOME 0
  15. www.baidu.com FALSE / FALSE 2482910974 delPer 0

打印的结果为:

  1. ########### 另一种 Cookie 的保存案例 ##########
  2. import http.cookiejar,urllib.request
  3. filename = "cookies.txt"
  4. cookie=http.cookiejar.LWPCookieJar(filename)
  5. handler=urllib.request.HTTPCookieProcessor(cookie)
  6. opener=urllib.request.build_opener(handler)
  7. response=opener.open("http://www.baidu.com")
  8. cookie.save(ignore_discard=True,ignore_expires=True)

代码运行结果与上面相同!

(5)异常处理

  1. from urllib import request,error
  2. try:
  3. response=request.urlopen("http://cuiqingcai.com/index.htm")
  4. except error.URLError as e:
  5. print(e.reason)
  1. Not Found

打印的结果为:

  1. from urllib import request,error
  2. try:
  3. response =request.urlopen('http://cuiqingcai.com/index.htm')
  4. except error.HTTPError as e:
  5. print(e.reason,e.code,e.headers,sep='\n')
  6. except error.URLError as e:
  7. print(e.reason)
  8. else:
  9. print("Request Successfully")
  1. Not Found
  2. 404
  3. Server: nginx/1.10.3 (Ubuntu)
  4. Date: Thu, 13 Sep 2018 11:08:18 GMT
  5. Content-Type: text/html; charset=UTF-8
  6. Transfer-Encoding: chunked
  7. Connection: close
  8. Vary: Cookie
  9. Expires: Wed, 11 Jan 1984 05:00:00 GMT
  10. Cache-Control: no-cache, must-revalidate, max-age=0
  11. Link: <https://cuiqingcai.com/wp-json/>; rel="https://api.w.org/"

打印的结果为:

  1. import socket
  2. import urllib.request
  3. import urllib.error
  4. try:
  5. response = urllib.request.urlopen("https://www.baidu.com",timeout=0.000000001)
  6. except urllib.error.URLError as e:
  7. print(type(e.reason))
  8. if isinstance(e.reason,socket.timeout):
  9. print("TimeOut")
  1. <class 'socket.timeout'>
  2. TimeOut

执行后的结果为:

 (6)URL解析

urlparse

  1. urllib.parse.urlparse(urlstring.scheme="",allow_fragments=True)
  1. from urllib.parse import urlparse
  2. result =urlparse("http://www.baidu.com/index.html;user?id=5i#comment")
  3. print(type(result),result)
  1. <class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5i', fragment='comment')

打印的结果为:

  1. ######## 无协议类型 ###########
  2. from urllib.parse import urlparse
  3. result =urlparse("www.baidu.com/index.html;user?id=5i#comment,scheme=/https")
  4. print(result)
  1. ParseResult(scheme='', netloc='', path='www.baidu.com/index.html', params='user', query='id=5i', fragment='comment,scheme=/https')

打印后的结果为:

  1. ######## 默认的协议类型 ###########
  2. from urllib.parse import urlparse
  3. result=urlparse("http://www.baidu.com/index.html;user?id=5i#comment,scheme=/https")
  4. print(result)
  1. ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5i', fragment='comment,scheme=/https')

打印后的结果为:

  1. from urllib.parse import urlparse
  2. result =urlparse("http://www.baidu.com/index.html;user?id=5i#comment",allow_fragments=False)
  3. print(result)
  1. ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5i#comment', fragment='')

打印后的结果为:

  1. from urllib.parse import urlparse
  2. result =urlparse("http://www.baidu.com/index.htmlf#comment",allow_fragments=False)
  3. print(result)
  1. ParseResult(scheme='http', netloc='www.baidu.com', path='/index.htmlf#comment', params='', query='', fragment='')

打印后的结果为:

urlunparse

  1. from urllib.parse import urlunparse
  2. data =["http","www.baidu.cogn","index.html","user",'a=6','comment']
  3. print(urlunparse(data))
  1. http://www.baidu.cogn/index.html;user?a=6#comment

执行后的结果

 urljoin(url拼接,前面若在为补充,后面若在为基准)

  1. from urllib.parse import urljoin
  2. print(urljoin('http://www.baidu.com','FAQ.html'))
  3. print(urljoin('http://www.baidu.com','https://cuiqingcai.com/FAQ.html'))
  4. print(urljoin('http://www.baidu.com/about.html','https://cuiqingcai.com/FAQ.html'))
  5. print(urljoin('http://www.baidu.com/about.html','https://cuiqingcai.com/FAQ.html?question=2'))
  6. print(urljoin('http://www.baidu.com?wd=abc','https://cuiqingcai.com/infex.php'))
  7. print(urljoin('http://www.baidu.com','?category=2#commen:'))
  8. print(urljoin('www.baidu.com','?category=2t#comment'))
  9. print(urljoin('www.baidu.comi#comment','?category=2'))
  1. http://www.baidu.com/FAQ.htmr
  2. https://cuiqingcai.com/FAQ.html
  3. https://cuiqingcai.com/FAQ.html
  4. https://cuiqingcai.com/FAQ.html?question=2
  5. https://cuiqingcai.com/infex.php
  6. http://www.baidu.com?category=2#commen:
  7. www.baidu.com?category=2t#comment
  8. www.baidu.comi?category=2

打印的结果为:

urlencode(把字典对象转化为GET请求参数)

  1. from urllib.parse import urlencode
  2. params={
  3. 'name':'germey',
  4. 'agel':''
  5. }
  6. base_url='http://www.baidu.com?'
  7. url=base_url+urlencode(params)
  8. print(url)
  1. http://www.baidu.com?name=germey&agel=22

打印的结果为:

爬虫--Urllib库详解的更多相关文章

  1. 爬虫入门之urllib库详解(二)

    爬虫入门之urllib库详解(二) 1 urllib模块 urllib模块是一个运用于URL的包 urllib.request用于访问和读取URLS urllib.error包括了所有urllib.r ...

  2. Python爬虫系列-Urllib库详解

    Urllib库详解 Python内置的Http请求库: * urllib.request 请求模块 * urllib.error 异常处理模块 * urllib.parse url解析模块 * url ...

  3. python爬虫知识点总结(三)urllib库详解

    一.什么是Urllib? 官方学习文档:https://docs.python.org/3/library/urllib.html 廖雪峰的网站:https://www.liaoxuefeng.com ...

  4. 爬虫(二):Urllib库详解

    什么是Urllib: python内置的HTTP请求库 urllib.request : 请求模块 urllib.error : 异常处理模块 urllib.parse: url解析模块 urllib ...

  5. urllib库详解 --Python3

    相关:urllib是python内置的http请求库,本文介绍urllib三个模块:请求模块urllib.request.异常处理模块urllib.error.url解析模块urllib.parse. ...

  6. Lua的协程和协程库详解

    我们首先介绍一下什么是协程.然后详细介绍一下coroutine库,然后介绍一下协程的简单用法,最后介绍一下协程的复杂用法. 一.协程是什么? (1)线程 首先复习一下多线程.我们都知道线程——Thre ...

  7. Python--urllib3库详解1

    Python--urllib3库详解1 Urllib3是一个功能强大,条理清晰,用于HTTP客户端的Python库,许多Python的原生系统已经开始使用urllib3.Urllib3提供了很多pyt ...

  8. Struts标签库详解【3】

    struts2标签库详解 要在jsp中使用Struts2的标志,先要指明标志的引入.通过jsp的代码的顶部加入以下的代码: <%@taglib prefix="s" uri= ...

  9. STM32固件库详解

    STM32固件库详解   emouse原创文章,转载请注明出处http://www.cnblogs.com/emouse/ 应部分网友要求,最新加入固件库以及开发环境使用入门视频教程,同时提供例程模板 ...

随机推荐

  1. 微信小程序wx.pageScrollTo的替代方案

    wx.pageScrollTo这个微信小程序的api功能如下: 简而言之就是实现页面滚动的.但是在实际应用当中显得有些鸡肋,为啥呢?使用中有明显页面有明显的抖动,这无疑是极不好的用户体验.我用的华为6 ...

  2. windows批处理学习---01

    一. 标记符号: CR(0D) 命令行结束符 Escape(1B) ANSI转义字符引导符 Space() 常用的参数界定符 Tab() ; = 不常用的参数界定符 + COPY命令文件连接符 * ? ...

  3. GetTickCount 和getTickCount

    GetTickCount:正常读取时间函数 getTickCount:不知道是什么鬼东东函数 都包含在windows.h中..运行出的结果天壤之别~~~

  4. MySQL中的条件语句

    判断学生表中成绩是否小于60,将小于60的学生成绩列为不及格 学生表(student) 字段:姓名(name),学号(主键)(num),性别(sex),成绩(score) select *,if(sc ...

  5. 第二章 IoC

    什么是IoC 如何配置IOC Bean的生命周期 多环境配置 条件化配置Bean 什么是IOC? IOC有两层含义, 1.控制反转:将对象实例的创建与销毁的权限交给Spring容器管理,而不再是调用对 ...

  6. P4316 绿豆蛙的归宿

    题意翻译 「Poetize3」 题目背景 随着新版百度空间的上线,Blog宠物绿豆蛙完成了它的使命,去寻找它新的归宿. 题目描述 给出一个有向无环图,起点为1终点为N,每条边都有一个长度,并且从起点出 ...

  7. python深浅copy-转自EVA的博客

    感谢Eva_J, http://www.cnblogs.com/Eva-J/p/5534037.html,新手上路,转载纯为自己学习. 初学编程的小伙伴都会对于深浅拷贝的用法有些疑问,今天我们就结合p ...

  8. BZOJ5011 & 洛谷4065 & LOJ2275:[JXOI2017]颜色——题解

    https://www.lydsy.com/JudgeOnline/problem.php?id=5011 https://www.luogu.org/problemnew/show/P4065 ht ...

  9. BZOJ3571 & 洛谷3236:[HNOI2014]画框——题解

    https://www.lydsy.com/JudgeOnline/problem.php?id=3571 https://www.luogu.org/problemnew/show/P3236 小T ...

  10. BZOJ1293:[SCOI2009]生日礼物——题解

    http://www.lydsy.com/JudgeOnline/problem.php?id=1293 https://www.luogu.org/problemnew/show/P2564#sub ...