安装requtests
- requests库的连接异常
- HTTP协议
  - HTTP协议对资源的操作
requests库的7个主要方法
- request方法
  - request方法的完整使用方法
    - methed:request的请求方式（7种）
- get方法
网络爬虫引发的问题
- robots协议
  - robots协议的遵守方式
- 网络爬虫实战

安装requtests

python2安装requests

python2 -m pip install requests

python3安装requests

python3 -m pip install requests

一个小demo

>>> import requests

>>> r = requests.get("http://www.baidu.com") # 访问百度主页

>>> r.status_code # 查看状态码，状态码为200表示访问成功

200

>>> r.encoding = 'utf-8' #更改编码为

>>> r.text # 打印网页内容

requests库的连接异常

requests.ConnectionError 网络连接错误异常，如DNS查询失败、拒绝连接等
requests.HTTPError HTTP错误异常
requests.URLRequired URL缺失异常
requests.TooManyRedirects 超过最大重定向次数，产生重定向异常
requests.ConnectTimeout 连接远程服务器超时异常
requests.Timeout 请求URL超时，产生超时异常

通用代码框架，一个小例子

import requests

def getHTMLText(url):

    try:

        r = requests.get(url, timeout=30)

        r.raise_for_status()

        print(r.apparent_encoding)

        r.encoding = r.apparent_encoding

        return r.text

    except:

        return "产生异常"

if __name__ == "__main__":

    url = "http://www.baidu.com"

    print(getHTMLText(url))

HTTP协议

HTTP，Hypertext Transfer Protocol，超文本传输协议。HTTP是一个基于“请求与响应”模式的、无状态的应用层协议。HTTP协议采用URL作为定位网络资源的标识，URL格式如下：

http://host[:port][path]

host: 合法的Internet主机域名或IP地址

port: 端口号，缺省端口为80

path: 请求资源的路径

HTTP URL实例：

http://www.bit.edu.cn

http://220.181.111.188/duty

HTTP URL的理解：

URL是通过HTTP协议存取资源的Internet路径，一个URL对应一个数据资源。

HTTP协议对资源的操作

GET 请求获取URL位置的资源
HEAD 请求获取URL位置资源的响应消息报告，即获得该资源的头部信息
POST 请求向URL位置的资源后附加新的数据
PUT 请求向URL位置存储一个资源，覆盖原URL位置的资源
PATCH 请求局部更新URL位置的资源，即改变该处资源的部分内容
DELETE 请求删除URL位置存储的资源

HTTP协议方法于requests库方法是一一对应的。

requests库的7个主要方法

requests.request() 构造一个请求，支撑以下各方法的基础方法

requests.get() 获取HTML网页的主要方法，对应于HTTP的GET

requests.head() 获取HTML网页头信息的方法，对应于HTTP的HEAD

requests.post() 向HTML网页提交POST请求的方法，对应于HTTP的POST

requests.put() 向HTML网页提交PUT请求的方法，对应于HTTP的PUT

requests.patch() 向HTML网页提交局部修改请求，对应于HTTP的PATCH

requests.delete() 向HTML页面提交删除请求，对应于HTTP的DELETE

head()方法示例

>>> r = requests.head('http://httpbin.org/get')

>>> r.headers

{'Content‐Length': '238', 'Access‐Control‐Allow‐Origin': '*', 'Access‐

Control‐Allow‐Credentials': 'true', 'Content‐Type':

'application/json', 'Server': 'nginx', 'Connection': 'keep‐alive',

'Date': 'Sat, 18 Feb 2017 12:07:44 GMT'}

>>> r.text

''

post()方法示例

>>> payload = {'key1': 'value1', 'key2': 'value2'}

>>> r = requests.post('http://httpbin.org/post', data = payload)

>>> print(r.text)

{ ...

"form": {

"key2": "value2",

"key1": "value1"

},

}

向URL POST一个字典，自动编码为form（表单）。

post字典，默认存到form表单中。

>>> r = requests.post('http://httpbin.org/post', data = 'ABC')

>>> print(r.text)

{ ...

"data": "ABC"

"form": {},

}

向URL POST一个字符串，自动编码为data。

post字符串，默认存到data中。

put()方法示例

>>> payload = {'key1': 'value1', 'key2': 'value2'}

>>> r = requests.put('http://httpbin.org/put', data = payload)

>>> print(r.text)

{ ...

"form": {

"key2": "value2",

"key1": "value1"

},

}

request方法

requsets库的request方法，是所有方法的基础方法。

request方法的完整使用方法

requests.request(method, url, **kwargs)

method : 请求方式，对应get/put/post等7种
url : 拟获取页面的url链接
**kwargs: 控制访问的参数，共13个

methed:request的请求方式（7种）

r = requests.request('GET', url, **kwargs)

r = requests.request('HEAD', url, **kwargs)

r = requests.request('POST', url, **kwargs)

r = requests.request('PUT', url, **kwargs)

r = requests.request('PATCH', url, **kwargs)

r = requests.request('delete', url, **kwargs)

r = requests.request('OPTIONS', url, **kwargs)

对应http协议的请求功能。

OPTIONS是向服务器获取一些服务器和客户端能够打交道的参数。

**kwargs: 控制访问的参数，均为可选项

params : 字典或字节序列，作为参数增加到url中

>>> kv = {'key1': 'value1', 'key2': 'value2'}

>>> r = requests.request('GET', 'http://python123.io/ws', params=kv)

>>> print(r.url)

http://python123.io/ws?key1=value1&key2=value2

data : 字典、字节序列或文件对象，作为Request的内容

>>> kv = {'key1': 'value1', 'key2': 'value2'}

>>> r = requests.request('POST', 'http://python123.io/ws', data=kv)

>>> body = '主体内容'

>>> r = requests.request('POST', 'http://python123.io/ws', data=body)

json : JSON格式的数据，作为Request的内容

>>> kv = {'key1': 'value1'}

>>> r = requests.request('POST', 'http://python123.io/ws', json=kv)

headers : 字典，HTTP定制头

>>> hd = {'user‐agent': 'Chrome/10'}

>>> r = requests.request('POST', 'http://python123.io/ws', headers=hd)

cookies : 字典或CookieJar，Request中的cookie

auth : 元组，支持HTTP认证功能

files : 字典类型，传输文件

>>> fs = {'file': open('data.xls', 'rb')}

>>> r = requests.request('POST', 'http://python123.io/ws', files=fs)

timeout : 设定超时时间，秒为单位

>>> r = requests.request('GET', 'http://www.baidu.com', timeout=10)

proxies : 字典类型，设定访问代理服务器，可以增加登录认证

>>> pxs = { 'http': 'http://user:pass@10.10.10.1:1234'

'https': 'https://10.10.10.1:4321' }

>>> r = requests.request('GET', 'http://www.baidu.com', proxies=pxs)

allow_redirects : True/False，默认为True，重定向开关

stream : True/False，默认为True，获取内容立即下载开关

verify : True/False，默认为True，认证SSL证书开关

cert : 本地SSL证书路径

get方法

get方法的常用方式

r = requests.get(url)

r返回一个包含服务器资源的Response对象

get方法构造一个向服务器请求资源的Request对象

get方法的完整使用方法

requests.get(url, params=None, **kwargs)

url : 拟获取页面的url链接
params : url中的额外参数，字典或字节流格式，可选
**kwargs: 12个控制访问的参数，可选

>>> import requests

>>> r = requests.get("http://www.baidu.com") # 访问百度主页

>>> print(r.status_code) # 打印请求的状态码

200

>>> type(r) #查看r的类型

<class 'requests.models.Response'>  #r是一个类，类的名是requests

>>> r.headers # 返回get请求获得页面的头部信息

{'Server': 'bfe/1.0.8.18', 'Date': 'Wed, 19 Apr 2017 09:28:11 GMT', 'Content-Type': 'text/html', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:33 GMT', 'Transfer-Encoding': 'chunked', 'Connection': 'Keep-Alive', 'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Pragma': 'no-cache', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Content-Encoding': 'gzip'}

对于状态码，如果状态码为200，那么访问成功；如果状态码不是200，那么访问失败。

response对象包含服务器返回的所有信息，也包含请求的request信息。

response对象的属性

r.status_code HTTP请求的返回状态，200表示连接成功，404表示失败
r.text HTTP响应内容的字符串形式，即，url对应的页面内容
r.encoding 从HTTP header中猜测的响应内容编码方式
r.apparent_encoding 从内容中分析出的响应内容编码方式（备选编码方式）
r.content HTTP响应内容的二进制形式

response的编码

r.encoding：如果header中不存在charset，则认为编码为ISO‐8859‐1

r.text根据r.encoding显示网页内容

r.apparent_encoding：根据网页内容分析出的编码方式，可以看作是r.encoding的备选

r.apparent_encoding比r.encoding更可靠

网络爬虫引发的问题

爬取网页，玩转网页

小规模，数据量小，对爬取速度不敏感，此时用requests库。
爬取网站，爬取系列网站

中规模，数据规模较大，对爬取速度敏感。比如爬取携程。此时用scrapy库。
爬取全网

规模大，对于搜索引擎，它的爬取速度是关键。此时只能定制开发。

骚扰服务器。Web服务器默认接收人类访问。受限于编写水平和目的，网络爬虫将会为Web服务器带来巨大的资源开销。

对产权有法律风险。服务器上的数据有产权归属。网络爬虫获取数据后牟利将带来法律风险。

泄露隐私。网络爬虫可能具备突破简单访问控制的能力，获得被保护数据从而泄露个人隐私。

服务器如何对网络爬虫的限制。

来源审查：判断User‐Agent进行限制（有技术难度）

检查来访HTTP协议头的User‐Agent域，只响应浏览器或友好爬虫的访问
发布公告：Robots协议

告知所有爬虫网站的爬取策略，要求爬虫遵守

robots协议

Robots Exclusion Standard，网络爬虫排除标准

作用：

网站告知网络爬虫哪些页面可以抓取，哪些不行

形式：

在网站根目录下的robots.txt文件

例如：

京东的协议

https://www.jd.com/robots.txt

User‐agent: *

Disallow: /?*

Disallow: /pop/*.html

Disallow: /pinpai/*.html?*

User‐agent: EtaoSpider

Disallow: /

User‐agent: HuihuiSpider

Disallow: /

User‐agent: GwdangSpider

Disallow: /

User‐agent: WochachaSpider

Disallow: /

Robots协议基本语法:

User‐agent: *

Disallow: /

注释:*代表所有，/代表根目录

一些其它网站的robots

http://www.baidu.com/robots.txt

http://news.sina.com.cn/robots.txt

http://www.qq.com/robots.txt

http://news.qq.com/robots.txt

http://www.moe.edu.cn/robots.txt （无robots协议）

并不是所有的网站都存在robots.txt

robots协议的遵守方式

实际操作中，该如何遵守Robots协议？

网络爬虫：

自动或人工识别robots.txt，再进行内容爬取

约束性：

Robots协议是建议但非约束性，网络爬虫可以不遵守，但存在法律风险

类人行为可以不参考robots协议。

访问次数少。访问数据量小。可以不遵守该协议。

网络爬虫实战

京东商品页面的爬取

import requests

url = "https://item.jd.com/896813.html"

try:

    r = requests.get(url)

    r.raise_for_status()

    r.encoding = r.apparent_encoding

    print(r.text[:1000])

except:

    print("爬取失败")

亚马逊商品页面的爬取

import requests

url = "https://www.amazon.cn/电脑-it-办公/dp/B00D0393AM/ref=sr_1_4?s=pc&ie=UTF8&qid=1492660788&sr=1-4&keywords=移动硬盘"

try:

    kv = {'user-agent':'Mozilla/5.0'}

    r = requests.get(url,headers=kv)

    r.raise_for_status()

    r.encoding = r.apparent_encoding

    print(r.text[:1000])

except:

    print("爬取失败")

百度/360搜索关键字提交

import requests

keyword = "python"

try:

    kv = {'wd':keyword}

    r = requests.get("http://www.baidu.com/s",params=kv)

    #r = requests.get("http://www.so.com/s",params=kv)

    print(r.request.url)

    r.raise_for_status()

    print(len(r.text))

except:

    print("爬取失败")

网络图片的爬取和存储

import requests

import os

root = "D://pics//"

url = "http://image.nationalgeographic.com.cn/2017/0419/20170419035805561.jpg"

path = root+url.split('/')[-1]

try:

    if not os.path.exists(root):

        os.mkdir(root)

    if not os.path.exists(path):

        r = requests.get(url)

        with open(path,'wb') as f:

            f.write(r.content)

            f.close

            print("文件保存成功")

    else:

        print("文件已存在")

except:

    print("爬取失败")

IP地址归属地的自动查询

import requests

url = 'http://m.ip138.com/ip.asp?ip='

try:

    r = requests.get(url+'202.204.80.112')

    r.raise_for_status()

    r.encoding = r.apparent_encoding

    print(r.text[-500:])

except:

    print("爬取失败")

参考：

http://www.icourse163.org/course/BIT-1001870001

python爬虫之一：requests库的更多相关文章

Python爬虫之requests库介绍(一)
一:Requests: 让 HTTP 服务人类虽然Python的标准库中 urllib2 模块已经包含了平常我们使用的大多数功能,但是它的 API 使用起来让人感觉不太好,而 Requests 自称 ...
python爬虫之requests库
在python爬虫中,要想获取url的原网页,就要用到众所周知的强大好用的requests库,在2018年python文档年度总结中,requests库使用率排行第一,接下来就开始简单的使用reque ...
Python爬虫：requests 库详解，cookie操作与实战
原文第三方库 requests是基于urllib编写的.比urllib库强大,非常适合爬虫的编写. 安装: pip install requests 简单的爬百度首页的例子: response.te ...
Python爬虫之requests库的使用
requests库虽然Python的标准库中 urllib模块已经包含了平常我们使用的大多数功能,但是它的 API 使用起来让人感觉不太好,而 Requests宣传是 "HTTP for ...
【Python爬虫】Requests库的基本使用
Requests库的基本使用阅读目录基本的GET请求带参数的GET请求解析Json 获取二进制数据添加headers 基本的POST请求 response属性文件上传获取cookie 会 ...
python爬虫(1)requests库
在pycharm中安装requests库的一种方法首先找到设置搜索然后安装,蓝色代表已经安装 requests库中的get请求与HTTP协议相对应,requests库也有七种请求方式. 获取ur ...
python爬虫之requests库介绍(二)
一.requests基于cookie操作引言:有些时候,我们在使用爬虫程序去爬取一些用户相关信息的数据(爬取张三“人人网”个人主页数据)时,如果使用之前requests模块常规操作时,往往达不到我们 ...
Python爬虫之Requests库的基本使用
import requests response = requests.get('http://www.baidu.com/') print(type(response)) print(respons ...
Python爬虫系列-Requests库详解
Requests基于urllib,比urllib更加方便,可以节约我们大量的工作,完全满足HTTP测试需求. 实例引入 import requests response = requests.get( ...
python下载安装requests库
一.python下载安装requests库 1.到git下载源码zip源码https://github.com/requests/requests 2.解压到python目录下: 3.“win+R”进 ...

随机推荐

Informatica_(3)组件
一.Informatica介绍Informatica PowerCenter 是Informatica公司开发的世界级的企业数据集成平台,也是业界领先的ETL工具.Informatica PowerC ...
phpstorm+xdebug调试代码
1工具 #phpstorm 前面有文章介绍如何安装 #phpStudy 官网下的2018最新的安装包,php环境使用的也是最新的php7.0nts 2开启php Xdebug拓展开启拓展,phpSt ...
Shell脚本中$0、$?、$!等的意义
变量说明$$ Shell本身的PID(ProcessID)$! Shell最后运行的后台Process的PID$? 最后运行的命令的结束代码(返回值)$- 使用Set命令设定的Flag一览$* 所有参 ...
POJ3254或洛谷1879 Corn Fields
一道状压$DP$ POJ原题链接洛谷原题链接很显然的状压,$1$表示种植,$0$表示荒废. 将输入直接进行状压,而要满足分配的草场是适合种草的土地,即是分配时的状态中的$1$,在输 ...
LibreOJ #6007. 「网络流 24 题」方格取数最小割最大点权独立集最大流
#6007. 「网络流 24 题」方格取数内存限制:256 MiB时间限制:1000 ms标准输入输出题目类型:传统评测方式:文本比较上传者: 匿名提交提交记录统计讨论测试数据题目描述 ...
[最新原创电子书]lazarus开发者入门及中级教程
目前市面上没有任何一本完整的书,介绍Lazarus,Firebird这两个优秀的开发工具,同时还有一个作为他们之间桥梁的开发套件ZeosDBO,也没有任何完整的中文开发指南,本书以这三种开发套件为主线 ...
wordvec_词的相似度
import gensimfrom gensim.models import word2vecimport loggingimport jiebaimport osimport numpy as np ...
【搜索】棋盘问题（DFS）
Description 在一个给定形状的棋盘(形状可能是不规则的)上面摆放棋子,棋子没有区别.要求摆放时任意的两个棋子不能放在棋盘中的同一行或者同一列,请编程求解对于给定形状和大小的棋盘,摆放k个棋子 ...
Alpha 冲刺 (4/10)
队名火箭少男100 组长博客林燊大哥作业博客 Alpha 冲鸭鸭鸭鸭! 成员冲刺阶段情况林燊(组长) 过去两天完成了哪些任务协调各成员之间的工作协助前后端接口的开发测试项目运行的服务器环 ...
导入mysql报错问题
今天数据导入报错:Got a packet bigger than‘max_allowed_packet’bytes的问题 2个解决方法: 1.临时修改:mysql>set global max ...

python爬虫之一：requests库