爬虫学习--Urllib库基本使用 Day1

一、Urllib库详解

1、什么是Urllib

Python内置的HTTP请求库

urllib.request 　　　请求模块（模拟实现传入网址访问）

urllib.error 　　异常处理模块（如果出现错误，进行捕捉这个异常，然后进行重试和其他的操作保证程序不会意外的中止）

urllib.parse url解析模块（工具模块，提供了许多url处理方法，例如：拆分，合并等）

urllib.robotparser robots.txt解析模块（主要是用来识别网页的robots.txt文件，判断哪些网站是可以爬的，哪些是不可以爬的）

2、相比Python变化

Python2

import urllib2

response = urllib2.urlopen('http://www.baidu.com')

Python3

import urllib.request

response = urllib.request.urlopen('http://www.baidu.com')

3、基本用法

Urllib

urlopen

urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)

方法1

 import urllib.request

 response = urllib.request.urlopen('http://www.baidu.com')

 print(response.read().decode('utf-8'))  # 获取相应体的内容，用decode('utf-8')显示

方法2

import urllib.request

import urllib.parse

data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf-8')

response = urllib.request.urlopen('http://httpbin.org/post',data=data) # 加了data 是已post形式传递 ，不加则是get方式传递

print(response.read())

方法3

 import urllib.request

 response = urllib.request.urlopen('http://httpbin.org/get',timeout=1)

 print(response.read())

方法4

 import socket

 import urllib.request

 import urllib.error

 try:

     response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)

 except urllib.error.URLError as e:

     if isinstance(e.reason,socket.timeout):

         print('TIME OUT')

响应

响应类型

 import urllib.request

 response = urllib.request.urlopen('http://www.baidu.com')

 print(type(response))

状态码、响应头

 import urllib.request

 response = urllib.request.urlopen('http://www.python.org')

 print(response.status) # 获取状态码

 print(response.getheaders())  # 获取响应头

 print(response.getheader('Server')) # 获取特定的响应头，这里拿 Server举例

Request

url作为对象传给urlopen

 import urllib.request

 request = urllib.request.Request('https://python.org') # 把url封装成一个对象

 response = urllib.request.urlopen(request)  # 把对象传给urlopen一样可以访问

 print(response.read().decode('utf-8'))

添加request请求的方式

 from urllib import request,parse

 url = 'http://httpbin.org/post'

 headers={

     'User-Agent':'Mozilla/4.0(compatible;MSIE 5.5;Windows NT)',

     'Host':'httpbin.org'

 }

 dict = {

     'name':'Germey'

 }

 data = bytes(parse.urlencode(dict),encoding='utf-8')

 req = request.Request(url=url,data=data,headers=headers,method='POST')

 response = request.urlopen(req)

 print(response.read().decode('utf-8'))

request.add_header()方法

 from urllib import request,parse

 url = 'http://httpbin.org/post'

 dict = {

     'name':'Germey'

 }

 data = bytes(parse.urlencode(dict),encoding='utf-8')

 req = request.Request(url=url,data=data,method='POST')

 req.add_header('User-Agent','Mozilla/4.0(compatible;MSIE 5.5;Windows NT)')

 response = request.urlopen(req)

 print(response.read().decode('utf-8'))

Handler

代理

 import urllib.request

 # 构建了两个代理Handler，一个有代理IP，一个没有代理IP

 httpproxy_handler = urllib.request.ProxyHandler({"http" : "127.0.0.1:9743"})

 nullproxy_handler = urllib.request.ProxyHandler({})

 #定义一个代理开关

 proxySwitch = True

 # 通过 urllib2.build_opener()方法使用这些代理Handler对象，创建自定义opener对象

 # 根据代理开关是否打开，使用不同的代理模式

 if proxySwitch:

     opener = urllib.request.build_opener(httpproxy_handler)

 else:

     opener = urllib.request.build_opener(nullproxy_handler)

 request = urllib.request.Request("http://www.baidu.com/")

 # 使用opener.open()方法发送请求才使用自定义的代理，而urlopen()则不使用自定义代理。

 response = opener.open(request)

 # 就是将opener应用到全局，之后所有的，不管是opener.open()还是urlopen() 发送请求，都将使用自定义代理。

 urllib.request.install_opener(opener)

 # response = urlopen(request)

 print(response.read())

使用选择的代理构建代理处理器对象

 import urllib.request

 # 使用选择的代理构建代理处理器对象

 proxy_handler = urllib.request.ProxyHandler({

     'http':'http://127.0.0.1:9743',

     'https':'https://127.0.0.1:9743'

 })

 opener = urllib.request.build_opener(proxy_handler)

 request = urllib.request.Request("http://www.baidu.com")

 response = opener.open(request)

 print(response.read())

Cookie维持登陆状态的一个机制

实现cookie的获取

import http.cookiejar,urllib.request

 import http.cookiejar,urllib.request

 cookie = http.cookiejar.CookieJar()

 handler = urllib.request.HTTPCookieProcessor(cookie)

 opener = urllib.request.build_opener(handler)

 response = opener.open('http://www.baidu.com')

 for item in cookie:

     print(item.name+"="+item.value)

把cookie保存成一个文本文件

 import http.cookiejar,urllib.request

 filename = "cookie.txt"

 cookie = http.cookiejar.MozillaCookieJar(filename) # CookieJar子类的一个对象 MozillaCookieJar()

 handler = urllib.request.HTTPCookieProcessor(cookie)

 opener = urllib.request.build_opener(handler)

 response = opener.open('http://www.baidu.com')

 cookie.save(ignore_discard=True,ignore_expires=True) #  MozillaCookieJar()里包含了一个save()方法保存成txt文件

Cookie另一种保存格式方法2

 import http.cookiejar,urllib.request

 filename = "cookie.txt"

 cookie = http.cookiejar.LWPCookieJar(filename) # CookieJar子类的一个对象 LWPCookieJar()

 handler = urllib.request.HTTPCookieProcessor(cookie)

 opener = urllib.request.build_opener(handler)

 response = opener.open('http://www.baidu.com')

 cookie.save(ignore_discard=True,ignore_expires=True) #  LWPCookieJar()里包含了一个save()方法保存成txt文件

用cookie方法2的方法读取获取到的Cookie(LWPCookieJar())

import http.cookiejar,urllib.request

cookie = http.cookiejar.LWPCookieJar()

cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)

handler = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(handler)

response = opener.open('http://www.baidu.com')

print(response.read().decode('utf-8')) # 用文本文件的方式存储cookie,再读取出来放在request里请求访问网页，请求的结果就是登陆时候的看到的结果

URL解析

 # urlparse  urllib.parse.urlparse(urlstring,scheme='',allow_fragments=True)

 # 把url分割成许多部分

 from urllib.parse import urlparse,urlunparse

 result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')

 print(type(result),result) # 输出 <class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

 # 指定协议类型

 result = urlparse('www.baidu.com/index.html;user?id=5#comment',scheme='https')

 print(result) # 输出 ParseResult(scheme='https', netloc='', path='www.baidu.com/index.html', params='user', query='id=5', fragment='comment')

 #如果url里添加了协议，后面分割的就是这个协议方式

 result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',scheme='https')

 print(result) # 输出 ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

 #锚点链接 allow_fragments参数

 result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',allow_fragments=False)

 print(result) # 将comment拼接到query里 ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5#comment', fragment='')

 #把query去掉，直接拼接到path里

 result = urlparse('http://www.baidu.com/index.html#comment',allow_fragments=False)

 print(result) # 输出 ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html#comment', params='', query='', fragment='')

 #-----------------------------------------------------------------------------------------------------------------------

 # urlunparse 将url里的参数进行拼接成完整的url

 data = ['http','www.baidu.com','index.html','user','a=6','comment']

 print(urlunparse(data)) # 输出 http://www.baidu.com/index.html;user?a=6#comment

 #-----------------------------------------------------------------------------------------------------------------------

 # urljoin 后面url里的字段会覆盖前面的url

 from urllib.parse import urljoin

 print(urljoin('http://www.baidu.com/about.html','https://cuiqincai.com/FAQ.html'))

 # 输出 https://cuiqincai.com/FAQ.html

 #-----------------------------------------------------------------------------------------------------------------------

 from urllib.parse import urlencode

 params = {

     'name':'germey',

     'age':22

 }

 base_url = 'http://www.baidu.com?'

 url = base_url + urlencode(params) # 把字典转换成请求参数

 print(url) # 输出 http://www.baidu.com?name=germey&age=22

异常处理

 # from urllib import request,error # 1，2可用

 # 打印出异常处理

 # try:

 #     response = request.urlopen('http://wyh.com/index.html')

 # except error.URLError as e:

 #     print(e.reason) # 打印出异常原理，保证程序是正常运行的

 # 具体可以捕捉哪些异常

 # try:

 #     response = request.urlopen('http://wyh.com/index.html')

 # except error.HTTPError as e: # HTTPError是子类异常

 #     print(e.reason,e.code,e.headers,sep='\n') # e.headers 打印响应头的一些信息

 # except error.URLError as e:  # URLError是父类异常

 #     print(e.reason)

 # else:

 #     print('Request Successfully!')

 # 加一个原因判断

 import socket

 import urllib.request

 import urllib.error

 try:

     response = urllib.request.urlopen('http://www.baidu.com',timeout=0.01)

 except urllib.error.URLError as e:

     print(type(e.reason)) # 它是一个类

     if isinstance(e.reason,socket.timeout): # isinstance()方法判断是不是匹配的

         print('TIME OUT!')

爬虫学习--Urllib库基本使用 Day1的更多相关文章

python爬虫之urllib库（三）
python爬虫之urllib库(三) urllib库访问网页都是通过HTTP协议进行的,而HTTP协议是一种无状态的协议,即记不住来者何人.举个栗子,天猫上买东西,需要先登录天猫账号进入主页,再去 ...
python爬虫之urllib库（二）
python爬虫之urllib库(二) urllib库超时设置网页长时间无法响应的,系统会判断网页超时,无法打开网页.对于爬虫而言,我们作为网页的访问者,不能一直等着服务器给我们返回错误信息,耗费 ...
python爬虫之urllib库（一）
python爬虫之urllib库(一) urllib库 urllib库是python提供的一种用于操作URL的模块,python2中是urllib和urllib2两个库文件,python3中整合在了u ...
Python爬虫学习：Python内置的爬虫模块urllib库
urllib库 urllib库是Python中一个最基本的网络请求的库.它可以模拟浏览器的行为发送请求(都是这样),从而获取返回的数据 urllib.request 在Python3的urllib库当 ...
（爬虫）urllib库
一.爬虫简介什么是爬虫?通俗来讲爬虫就是爬取网页数据的程序. 要了解爬虫,还需要了解HTTP协议和HTTPS协议:HTTP协议是超文本传输协议,是一种发布和接收HTML页面的传输协议:HTTPS协议 ...
爬虫之urllib库
一.urllib库简介简介 Urllib是Python内置的HTTP请求库.其主要作用就是可以通过代码模拟浏览器发送请求.它包含四个模块: urllib.request :请求模块 urllib.e ...
python爬虫之urllib库介绍
一.urllib库 urllib是Python自带的一个用于爬虫的库,其主要作用就是可以通过代码模拟浏览器发送请求.其常被用到的子模块在Python3中的为urllib.request和urllib. ...
爬虫中urllib库
一.urllib库 urllib是Python自带的一个用于爬虫的库,其主要作用就是可以通过代码模拟浏览器发送请求.其常被用到的子模块在Python3中的为urllib.request和urllib. ...
python爬虫之urllib库
请求库 urllib urllib主要分为几个部分 urllib.request 发送请求urllib.error 处理请求过程中出现的异常urllib.parse 处理urlurllib.robot ...

随机推荐

Python 爬虫（四）：Selenium 框架
Selenium 是一个用于测试 Web 应用程序的框架,该框架测试直接在浏览器中运行,就像真实用户操作一样.它支持多种平台:Windows.Linux.Mac,支持多种语言:Python.Perl. ...
JAVA之类的动手动脑
1.默认构造方法与自定义的构造方法的冲突 package com.xu; class fool { int value; fool(int nowvalue) { value=nowvalue; } ...
cookie和session,cookie和web storage
一.cookie和session cookie和session的共同之处在于:cookie和session都是用来跟踪浏览器用户身份的会话方式. session指的是访问者从到达某个特定页面到离开为止 ...
KEIL软件中编译时出现的Error L6200E: symbol multiply defined ...的解决方法
原因:如LCD.C文件使用了bmp.h中的image[ ]变量,那么就不能将#include"bmp.h"放在LCD.H中,要将#include"bmp.h"放 ...
推荐一款超好用的工具cmder
今天来推荐一个超级好用的命令行工具:cmder 一款Windows环境下非常简洁美观易用的cmd替代者,它支持了大部分的Linux命令.支持ssh连接linux,使用起来非常方便.比起cmd.powe ...
sudo 提示 'xxx is not in the sudoers file.This incident will be reported.的解决方法'
在使用 Linux 的过程中,有时候需要临时获取 root 权限来执行命令时,一般通过在命令前添加 sudo 来解决. 但是第一次使用 sudo 时,有可能会得到这样一个错误提示 xxx is not ...
Mysql综述（1）数据是如何读存的
引言我们都知道,mysql中的索引,事务,锁等都是作为开发人员要重点掌握的知识面,但要想掌握理解好这些知识却并非易事. 其中原因之一就是这些概念都过于抽象,事实上如果都不懂mysql数据是以一种怎样 ...
初学 Spring MVC(基于 Spring in Action)
Spring MVC(Model-View-Controller) 当你看到本博文时,我猜你可能正面临着我已探索过的问题. 同其他博主一样,我先按照书上详细的介绍一下 Spring MVC,也是为了自 ...
【OUC2019写作】学术论文写作第九小组第一次博客作业
个人简介潘旻琦:我是潘旻琦:我的爱好是游泳:羊肉泡馍是海大食堂中我最喜欢的一道菜(清真食堂):一句想说的话是:“追随本心,坚持不懈”. 郭念帆:我是郭念帆:我的爱好是足球:海大食堂中最喜欢的一道菜偏 ...
linux-scp命令及如何设置免密登录
部署测试环境时经常在两台服务器间copy文件,那么如何设置免密登录? 场景:源服务器A(如172) -> 目标服务器B(如71) 实现将服务器A的文件copy到服务器B 实现方式有两种: 在源 ...

爬虫学习--Urllib库基本使用 Day1

爬虫学习--Urllib库基本使用 Day1的更多相关文章

随机推荐

热门专题