Python爬虫学习==>第七章：urllib库的基本使用方法

学习目的：

　　urllib提供了url解析函数，所以需要学习
正式步骤

Step1：什么是urllib

　　urllib库是Python自带模块，是Python内置的HTTP请求库

　　包含4个模块：
　　

>>> import urllib

>>> # urllib.request　　请求模块

>>> # urllib.error　　异常处理模块

>>> # urllib.parse　　url解析模块

>>> # urllib.robotparser　　robot.txt解析模块

Step2：用法讲解

urlopen

# -*-  coding:utf-8 -*

import urllib.request

'''

urlopen语法格式如下

urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,*, cafile=None, capath=None, cadefault=False, context=None)

'''

#示例1

response = urllib.request.urlopen('http://www.baidu.com')

print(response.read().decode('utf-8'))    #read()方法是获取了response内容，然后指定编码打印出来,如果不加decode，那么打印则显示在一行

print('\n')

print('urllib.parse实例')

print('\n')

import urllib.request

import urllib.parse

data =  bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf-8')

response = urllib.request.urlopen('http://httpbin.org/post',data=data)

print(response.read())

print('\n')

print('urllib中的timeout用法和urllib.error异常处理模块')

print('\n')

import urllib.request

import socket

import urllib.error

try:

    response = urllib.request.urlopen('http://httpbin.org/get',timeout=0.1)

except urllib.error.URLError as e:

    if isinstance(e.reason,socket.error):

        print('TIMEOUT')

响应

# -*-  coding:utf-8 -*-

print("响应类型实例")

import urllib.request

response = urllib.request.urlopen('http://www.baidu.com')

print(type(response))

状态码和响应头

# -*-  coding:utf-8 -*-

print('状态码和响应头的实例')

import urllib.request

response = urllib.request.urlopen('http://www.baidu.com')

print(response.status)

print(response.getheaders())

print(response.getheader('Content-Type'))

print(response.getheader('Date'))

print(response.getheader('Server'))

运行结果

状态码和响应头的实例

200

[('Date', 'Tue, 03 Apr 2018 14:29:52 GMT'), ('Content-Type', 'text/html; charset=utf-8'), ('Transfer-Encoding', 'chunked'), ('Connection', 'Close'), ('Vary', 'Accept-Encoding'), ('Set-Cookie', 'BAIDUID=6150350FD6AF7F0B4629DA49AEF7DEAE:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'BIDUPSID=6150350FD6AF7F0B4629DA49AEF7DEAE; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'PSTM=1522765792; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'BDSVRTM=0; path=/'), ('Set-Cookie', 'BD_HOME=0; path=/'), ('Set-Cookie', 'H_PS_PSSID=1430_25809_13290_21093_20927; path=/; domain=.baidu.com'), ('P3P', 'CP=" OTI DSP COR IVA OUR IND COM "'), ('Cache-Control', 'private'), ('Cxy_all', 'baidu+66a85a47dcb1b7de8cd2d7ba25b3a1dc'), ('Expires', 'Tue, 03 Apr 2018 14:29:42 GMT'), ('X-Powered-By', 'HPHP'), ('Server', 'BWS/1.1'), ('X-UA-Compatible', 'IE=Edge,chrome=1'), ('BDPAGETYPE', ''), ('BDQID', '0xa1de1b2000003abd'), ('BDUSERID', '')]

text/html; charset=utf-8

Tue, 03 Apr 2018 14:29:52 GMT

BWS/1.1

handler 代理

# -*-  coding:utf-8 -*-

import urllib.request

proxy_hander = urllib.request.ProxyHandler(

    {'http':'http://127.0.0.1:9743','https':'https://127.0.0.1:9743'}

)#代理以实际代理为准

opener = urllib.request.build_opener(proxy_hander)

response = opener.open('http://www.baidu.com')

print(response.read())

cookie （记录用户身份的文本文件）

# -*-  coding:utf-8 -*-

import urllib.request,http.cookiejar

#将cookie保存

filename = 'cookie.txt'

cookie = http.cookiejar.LWPCookieJar(filename)

hander = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(hander)

responer = opener.open('http://www.baidu.com')

cookie.save(ignore_discard=True,ignore_expires=True)

打印cookie

# -*-  coding:utf-8 -*-

import urllib.request,http.cookiejar

#声明cookie为cookiejar对象

cookie = http.cookiejar.CookieJar()

#hander是处理浏览器中的cookie作用

hander = urllib.request.HTTPCookieProcessor(cookie)

#利用build_opener将cookie传给opener

opener = urllib.request.build_opener(hander)

responser = opener.open('http://www.baidu.com')

for i in cookie:

    print(i.name + '=' + i.value)

将本地的cookie值赋到浏览器

# -*-  coding:utf-8 -*-

import urllib.request,http.cookiejar

cookie = http.cookiejar.LWPCookieJar()

cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)

hander = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(hander)

response = opener.open('http://www.baidu.com')

print(response.read().decode('utf-8'))

学习总结：

　　其余的内置方法未操作，直接学习下一节requests库

Python爬虫学习==>第七章：urllib库的基本使用方法的更多相关文章

python爬虫---从零开始（二）Urllib库
接上文再继续我们的爬虫,这次我们来述说Urllib库 1,什么是Urllib库 Urllib库是python内置的HTTP请求库 urllib.request 请求模块 urllib.error 异常 ...
Python爬虫学习==>第八章：Requests库详解
学习目的: request库比urllib库使用更加简洁,且更方便. 正式步骤 Step1:什么是requests requests是用Python语言编写,基于urllib,采用Apache2 Li ...
Python爬虫（2）：urllib库
爬虫常用库urllib 注:运行环境为PyCharm urllib是Python3内置的HTTP请求库 urllib.request:请求模块 urllib.error:异常处理模块 urllib.p ...
【Python爬虫】HTTP基础和urllib库、requests库的使用
引言: 一个网络爬虫的编写主要可以分为三个部分: 1.获取网页 2.提取信息 3.分析信息本文主要介绍第一部分,如何用Python内置的库urllib和第三方库requests库来完成网页的获取.阅 ...
Python爬虫学习==>第五章：爬虫常用库的安装
学习目的: 爬虫有请求库(request.selenium).解析库.存储库(MongoDB.Redis).工具库,此节学习安装常用库的安装正式步骤 Step1:urllib和re库这两个库在安装 ...
Python爬虫学习==>第十一章：分析Ajax请求-抓取今日头条信息
学习目的: 解决AJAX请求的爬虫,网页解析库的学习,MongoDB的简单应用正式步骤 Step1:流程分析抓取单页内容:利用requests请求目标站点,得到单个页面的html代码,返回结果: ...
Python爬虫学习==>第六章：爬虫的基本原理
学习目的: 掌握爬虫相关的基本概念正式步骤 Step1:什么是爬虫请求网站并提取数据的自动化程序 Step2:爬虫的基本流程 Step3:Request和Response 1.request 2. ...
python爬虫学习(三)：使用re库爬取"淘宝商品"，并把结果写进txt文件
第二个例子是使用requests库+re库爬取淘宝搜索商品页面的商品信息 (1)分析网页源码打开淘宝,输入关键字“python”,然后搜索,显示如下搜索结果从url连接中可以得到搜索商品的关键字是 ...
Python爬虫学习==>第三章：Redis环境配置
学习目的: 学习非关系型数据库环境安装,为后续的分布式爬虫做基建正式步骤 Step1:安装Redis 打开http://www.runoob.com/,搜索redis安装打开搜索的内容,得到red ...

随机推荐

ios11返回按钮问题
在苹果系统升级到iOS11之后,页面的返回按钮的点击区域是根据设置的按钮的frame来确定的,在设置按钮太小的时候,点击就会出现点击多次才能点击到一次的现象,处理的方法就是设置按钮的frame变大代码 ...
Alpha个人项目测试
这个作业属于哪个课程 [课程链接][ ] 这个作业要求在哪里 [作业要求][ ] 团队名称 [山海皆可平][ ] 作业目标对其他小组进行测试测试报告姓名唐友鑫学号 201631062121 ...
pycharm 怎么能像在命令行中输入参数进行调试
pycharm中配置main参数 Run->Edit Configurations->Script Parames 把需要在xxx.py A B C 后面的参数输入到如下位置. 否则会报错 ...
虚拟dom比对原理
dom对比步骤 1.用js对象来表达dom结构 tagName 标签名props 元素属性key 唯一标识children 子元素,格式和父元素一样count 有几个子元素,用于计算当前元素的索引,处 ...
BZOJ 1818: [Cqoi2010]内部白点 (BIT + 扫描线)
就是求多条线段的交点数,直接BIT+扫描线就行了. 注意不要算重最初存在的点. CODE #include<bits/stdc++.h> using namespace std; char ...
[Algorithm] Convert a number from decimal to binary
125, how to conver to binary number? function DecimalToDinary (n) { let temp = n; let list = []; if ...
题解【NOIP2011】计算系数
[NOIP2011]计算系数 Description 给定一个多项式 (ax+by)^k ,请求出多项式展开后 x^n * y^m 项的系数. Input 共一行,包含 5 个整数,分别为 a,b,k ...
【Winform-自定义控件】 DataGridView多维表头
[datagridview与treeview绑定] treeview 代码: DataTable dtable = new DataTable("Rock") ...
B/S上传超大文件解决方案
4GB以上超大文件上传和断点续传服务器的实现随着视频网站和大数据应用的普及,特别是高清视频和4K视频应用的到来,超大文件上传已经成为了日常的基础应用需求. 但是在很多情况下,平台运营方并没有大文件上 ...
C语言 - sizeof和strlen的区别
sizeof和strlen的区别: 1.sizeof操作符的结果类型是size_t,它在头文件中typedef为unsigned int类型. 该类型保证能容纳实现所建立的最大对象的字节大小. 2.s ...

Python爬虫学习==>第七章：urllib库的基本使用方法

Python爬虫学习==>第七章：urllib库的基本使用方法的更多相关文章

随机推荐

热门专题