#方式一：

import urllib.request

f = urllib.request.urlopen('http://www.baidu.com')

result = f.read().decode('utf-8')

print(result)

#方式二：

import urllib.request

req = urllib.request.Request('http://www.baidu.com')

response = urllib.urlopen(req)

result = response.read().decode('utf-8')

print(result)

ps：硬要使用urllib模块，推荐使用方式二，因为req是一个Request对象，在这个对象里，可以定义请求的头部信息，这样可以把自己包装成像个浏览器发起的请求，如下面的一个例子

　　3.3自定义请求头信息

import urllib.request

req = urllib.request.Request('http://www.example.com')

#自定义头部，第一个参数为关键字参数key，第二个参数为内容

req.add_header("User-Agent","Mozilla/5.0(X11;Ubuntu;Linux x86_64;rv:39.0) Gecko/20100101 Firefox/39.0") 

f = urllib.request.urlopen(req)

result = f.read().decode('utf-8')

#有一个模块fake_useragent可以随机产生User-Agent信息，对于网站的反爬虫机制有一定的欺骗作用

　　3.4 fake_useragent使用

#1.安装pip install fake_useragent

#2.基本使用

from fake_useragent import UserAgent

ua = UserAgent()

print(ua.chrome)   #产生一个谷歌的内核字段

#常用属性

ua.chrome      #产生一个谷歌的内核字段

ua.ie              #随机产生ie内核字段

ua.firefox       #随机产生火狐内核字段

ua.random    #随机产生不同浏览器的内核字段

四、爬虫请求模块之requests

　　4.1 requests模块介绍

Requests是使用Apache2 Licensed许可证的，基于Python开发的HTTP库，其在Python内置模块的基础上进行了高度的封装，从而使得进行网络请求时，

变得美好了许多，而且使用Requests可以轻而易举的完成浏览器可以做到的任何操作

　　4.2 requests安装

pip3 install requests

　　4.3 简单使用

import requests

r = requests.get('http://www.example.com')

print(type(r))

print (r.status_code)   #服务器返回的状态码

print (r.encoding)       #网站使用的编码

print (r.text)              #返回的内容，字符串类型

　　4.4 get请求

#1.无参数实例

import requests

res = requests.get('http://www.example.com')

print (res.url)    #打印请求的url

print (res.text)    #打印服务器返回的内容

#2.有参数实例

import requests

payload = {'k1':'v1','k2':'v2'}

res = requests.get('http://httpbin.org/get'，params=payload)

print (res.url)

print (res.text)

#3.解析json

import requests

import json

response = rquests.get('http://httpbin.org/get')

print (type(response.text))    #返回结果是字符串类型

pirnt (response.json())          #字符串转成json格式

print (json.loads(response.text))  #字符串转成json格式

print (type(response.json()))    #json类型

#4.添加headers

import requests

from fake_useragent import UserAgent

ua = UserAgent()

#自定义请求头部信息

headers= {

    'User-Agent':ua.chrome

}

response = requests.get('http://www.zhihui.com',headers = headers)

print (response.text)

　　4.5 post请求

#1.基本POST实例

import requests

#当headers为application/content的时候，请求实例如下：

payload = {'k1':'v1','k2':'v2'}

res = requests.post('http://httpbin.org/post',data = payload)

print (res.text)

print (type(res.headers),res.headers)

print (type(res.cookies),res.cookies)

print (type(res.url),res.url)

print (type(res.history),res.history)

#2.发送请求头和数据实例

import json

import requests

url = 'http://httpbin.org/post'

payload = {'some':'data'}

headers = {'content-type':'application/json'}

#当headers为application/json的时候,请求实例如下：

res = requests.post(url,data=json.dumps(payload), headers = headers)

print (res.text)

　　4.6关于get与post请求的差别

get请求方法参数只有params，而没有data参数，而post请求中两者都是有的

　　4.7 http返回代码

 100：continue

 101 : switching_protocols

 102 : processing

 103 : checkpoint

 122 : uri_too_long , request_uri_too_long

 200 : ok ， okay, all_ok all_okay , all_good, \\o/ , '√'

 201 : created

 202 ： accepted

 203 : non_authoritative_info , non_authoritative_information

 204 : no_content

 205 : rest_content , reset

 206 : partial_content, partial

 207 :multi_status , multiple_status multi_stati multiple_stati

 208 : already_reported

 226 : im_used

 #Redirection

 300 :multipel_choices

 301 : moved_permanently , moved , \\o-

 302 : found

 303 : see_other , other

 304 : not_modified

 305 : use_proxy

 306 : switch_proxy

 307 : remporay_redirect , temporary_moved , temporary

 308 : permanent_redirect , resume_incomplete , resume  #These 2 to be removed in 3.0

 #client Error

 400 :bad_request , bad

 401 : unauthorized

 402 : payment_required payment

 403 : forbiden

 404 : not_found , -o-

 405 : method_not_allowed not_allowed

 406 : not_acceptable

 407 : proxy_authentication_required , proxy_auth , proxy_authentication

 408 : request_timeout  , timeout

 409 : conflict

 410 :gone

 411 :length_required

 412 : precondition_failed , precondition

 413 : request_entity_too_large

 414 : requests_uri_too_large

 415 : unsupported_media_type, unsupported_media , media_type

 416: ('requested_range_not_satisfiable', 'requested_range', 'range_not_satisfiable'),

 417: ('expectation_failed',),

 418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'),

 421: ('misdirected_request',),

 422: ('unprocessable_entity', 'unprocessable'),

 423: ('locked',),

 424: ('failed_dependency', 'dependency'),

 425: ('unordered_collection', 'unordered'),

 426: ('upgrade_required', 'upgrade'),

 428: ('precondition_required', 'precondition'),

 429: ('too_many_requests', 'too_many'),

 431: ('header_fields_too_large', 'fields_too_large'),

 444: ('no_response', 'none'),

 449: ('retry_with', 'retry'),

 450: ('blocked_by_windows_parental_controls', 'parental_controls'),

 451: ('unavailable_for_legal_reasons', 'legal_reasons'),

 499: ('client_closed_request',),

 # Server Error.

 500: ('internal_server_error', 'server_error', '/o\\', '✗'),

 501: ('not_implemented',),

 502: ('bad_gateway',),

 503: ('service_unavailable', 'unavailable'),

 504: ('gateway_timeout',),

 505: ('http_version_not_supported', 'http_version'),

 506: ('variant_also_negotiates',),

 507: ('insufficient_storage',),

 509: ('bandwidth_limit_exceeded', 'bandwidth'),

 510: ('not_extended',),

 511: ('network_authentication_required', 'network_auth', 'network_authentication')

　　4.8 获得cookies

#会话登录

import requests

s = requests.Session()

s.get('http://www.httpbin.org/cookies/set/123456789') #设置cookies

res = s.get('http://www.httpbin.org/cookies')  #获得cookies

print (res.text)   #打印cookies

此httpbin.org是通过以上方式来设置cookies

#获得cookie

import requests

response = requests.get('http://www.baidu.com')

#print ('response.cookies')

for key,value in reponse.cookies.items():

    print (key + '=' + value)         #组合key = value

　　4.7 SSL设置

#ssl设置

import requests

from requests.packages import urllib3

urllib3.disable_warnings()

res = requests.get('http://www.12306.cn',verify = False)

print (res.status_code)

#证书认证

import requests

res = requests.get('https://www.12306.cn',cert=('/path/server.crt','/path/key'))

print (res.status_code)

　　4.8 代理设置

import requests

proxies = {

    "http":"http://127.0.0.1:9746",

    "https":"https://127.0.0.1:9924"

}

res = requests.get("http://www.taobao.com",proxies = procies)

print (res.status_code)

#有密码的代理

import requests

proxies = {

    "https":"https://user:password@127.0.0.1:9924"

}

res = requests.get("http://www.taobao.com",proxies = procies)

print (res.status_code)

　　4.9 超时时间设置与异常处理

import requests

from requests.exceptions import ReadTimeout

try:

    res = requests.get('http://httpbin.org/get',timeout=0.5)

except ReadTimeout:

    print ('Timeout')

　　4.10 案例：检测QQ是否在线

import urllib

import requests

from xml.etree import ElementTree as ET

#使用内置模块urllib发送http请求

r = urllib.request.urlopen('http://www.webxml.com.cn/webservices/qqOnlineWebService.asmx/qqCheckOnline?qqCode=3455306**')

result = r.read().decode('utf-8')

#使用第三方模块requests发送http请求

r = requetsts.get('http://www.webxml.com.cn/webservices/qqOnlineWebService.asmx/qqCheckOnline?qqCode=3455306**')

result = r.text

#解析XML格式内容

node = ET.XML(result)

#获取内容

if node.text =='Y':

    print ('在线')

else:

    print ('离线')

五、爬虫分析之re模块

　　5.1 关于re模块的使用方法

http://www.cnblogs.com/lisenlin/articles/8797892.html#1

　　5.2 爬虫简单案例

import requests

import re

from fake_useragent import UserAgent

def get_page(url):

    ua = UserAgent()

    headers = {

        'User-Agent':ua.chrome,

    }

    response = requests.get(url, headers = headers)

    try:

        if response.status_code == 200:

            res = response.text

            return res

        return None

    except Exception as e:

        print(e)

def get_movie(html):

    partten = '<p.*?><a.*?>(.*?)</a></p>.*?<p.*?>(.*?)</p>.*?<p.*?>(.*?)</p>'

    items = re.findall(partten, html, re.S)

    #print((items))

    return items

def write_file(items):

    fileMovie = open('movie.txt', 'w', encoding='utf8')

    try:

        for movie in items:

            fileMovie.write('电影排名：' + movie[0] + '\r\n')

            fileMovie.write('电影主演：' + movie[1].strip() + '\r\n')

            fileMovie.write('上映时间：' + movie[2] + '\r\n\r\n')

        print('文件写入成功...')

    finally:

        fileMovie.close()

def main(url):

    html = get_page(url)

    items = get_movie(html)

    write_file(items)

if __name__ == '__main__':

    url = "http://maoyan.com/board/4"

    main(url)

Python模块之requests,urllib和re的更多相关文章

windows安装Python模块：requests
个人在windows10安装python模块requests如下过程: 1.下载requests模块:首先打开powershell, cd到你要下载文件的位置(我的是d:\softwareinstal ...
Python模块之Requests
目录 Requests 模块常规的get请求基于ajax的get请求常规的post请求基于ajax的post请求综合项目实战 requests模块高级 requests代理验证码处理 Re ...
python模块（requests，logging）
一.requests Requests 是使用 Apache2 Licensed 许可证的基于Python开发的HTTP 库,其在Python内置模块的基础上进行了高度的封装,从而使得Pythone ...
python模块中requests参数stream
PS:这个参数真没用过当下载大的文件的时候,建议使用strea模式．默认情况下是false,他会立即开始下载文件并存放到内存当中,倘若文件过大就会导致内存不足的情况．当把get函数的stream ...
python之urllib模块和requests模块
一.urllib模块 python标准库自带的发送网络请求的模块. # 用python怎么打开浏览器,发送接口请求 import urllib from urllib.request import u ...
Pthon常用模块之requests，urllib和re
urllib Python标准库中提供了:urllib等模块以供Http请求,但是,它的 API 太渣了. 它需要巨量的工作,甚至包括各种方法覆盖,来完成最简单的任务, 下面是简单的使用urllib来 ...
python网络编程----requests模块
python访问网站可以用标准模块--urllib模块(这里省略),和requests(安装-pip install requests)模块,requests模块是在urllib的基础上进行的封装,比 ...
Python高手之路【八】python基础之requests模块
1.Requests模块说明 Requests 是使用 Apache2 Licensed 许可证的 HTTP 库.用 Python 编写,真正的为人类着想. Python 标准库中的 urllib2 ...
Python内置的urllib模块不支持https协议的解决办法
Django站点使用django_cas接入SSO(单点登录系统),配置完成后登录,抛出“urlopen error unknown url type: https”异常.寻根朔源发现是python内 ...

随机推荐

js 使用ES6 实现从json中取值并返回新的数组或者字符串
1.获取的json数据是这样的: data:[ { 'Id': '1', 'Phone': '123456', 'Name': '张三', }, { 'Id': '2', 'Phone': '7894 ...
Uncaught Error: Bootstrap dropdown require Popper.js
Bootstrap 要求Popper.js 如果安装了Popper.js还报错,肯定就是Popper的问题 https://cdn.bootcss.com/popper.js/1.12.5/umd/p ...
R018---RPA是什么东东？
1.缘起这个问题,很多文章回答过,一直想站在客户角度写个答案,今天正好. 2.RPA的名字 RPA是英文Robotic Process Automation的缩写,中文爱翻译为“流程自动化机器人” ...
Hive之函数与自定义函数
系统自带的函数 1)查看系统自带的函数 hive> show functions; 2)显示自带的函数的用法 hive> desc function upper; 3)详细显示自带的函数的 ...
[leetcode] 11. Container With Most Water (medium)
原题链接以Y坐标长度作为木桶边界,以X坐标差为桶底,找出可装多少水. 思路: 前后遍历. Runtime: 5 ms, faster than 95.28% of Java class Soluti ...
使用Kubeadm创建k8s集群之部署规划（三十）
前言上一篇我们讲述了使用Kubectl管理k8s集群,那么接下来,我们将使用kubeadm来启动k8s集群. 部署k8s集群存在一定的挑战,尤其是部署高可用的k8s集群更是颇为复杂(后续会讲).因此 ...
java读写文件小心缓存数组
一般我们读写文件的时候都是这么写的,看着没问题哈. public static void main(String[] args) throws Exception { FileInputStr ...
golang在多个go routine中进行map或者slice操作应该注意的对象。
因为golang的map和列表切片都是引用类型,且非线程安全的,所以在多个go routine中进行读写操作的时候,会产生“map read and map write“的panic错误. 某一些类型 ...
oracle一条语句插入多个值的方法
今天在实践过程中遇到一个问题, 我想往数据库插入多条数据时,使用了如下语句: insert into 表1 (字段1,字段2) values (1,2),(2,3),(3,4); 这条语句在mysql ...
hdoj 4706 Children's Day
题目意思就是用a-z组成一个N,然后到z后又跳回a,输出宽从3到10的N. #include <stdio.h> #include <string.h> char s[14][ ...

Python模块之requests,urllib和re

目录

一、爬虫的步骤

二、使用Jupyter

三、爬虫请求模块之urllib

四、爬虫请求模块之requests

五、爬虫分析之re模块

一、爬虫的步骤

二、使用Jupyter

三、爬虫请求模块之urllib

四、爬虫请求模块之requests

五、爬虫分析之re模块

Python模块之requests,urllib和re的更多相关文章

随机推荐

热门专题