title: 爬虫入门一基础知识以及request

date: 2020-03-05 14:43:00

categories: python

tags: crawler

爬虫整体概述，基础知识。

requests库的学习

1.request

Requests 是用Python语言编写，基于 urllib，采用 Apache2 Licensed 开源协议的 HTTP 库

http://docs.python-requests.org/en/latest/

1.1

import requests

                 r=requests.get("http://www.whu.edu.cn/ ")   #返回reponse对象

                 print(r.status_code)

   返回值为200时，表明运行正常

输入：r.text  得到网页内容

HTTP状态码

200 成功/正常

404

503

…

1.2 http header

https://www.jianshu.com/p/6f29fcf1a6b3

HTTP（HyperTextTransferProtocol）即超文本传输协议，目前网页传输的的通用协议。HTTP协议采用了请求/响应模型，浏览器或其他客户端发出请求，服务器给与响应。就整个网络资源传输而言，包括message-header和message-body两部分。

根据维基百科对http header内容的组织形式，大体分为Request和Response两部分。

Header中有charset （字符集，也就是编码方式）

r.encoding是从HTTP header中猜测的响应内容编码方式，如果header中不存在charset,则认为编码为‘ISO-8859-1’(无法解析中文字符)

r.apparent_encoding是requests根据网页内容分析出来的

输入“r.encoding ” 查看该网页编码方式为'ISO-8859-1‘

输入“r.apparent_encoding”查看网页编码为'utf-8‘

输入“r.encoding=r.apparent_encoding”

再输入“r.text”,可以发现网页内容变为可以看懂的字符

1.3 异常

遇到网络问题（如：DNS查询失败、拒绝连接等）时，Requests会抛出一个ConnectionError 异常。

遇到罕见的无效HTTP响应时，Requests则会抛出一个 HTTPError 异常。

若请求超时，则抛出一个 Timeout 异常。

若请求超过了设定的最大重定向次数，则会抛出一个 TooManyRedirects 异常。

所有Requests抛出的异常都继承自 requests.exceptions.RequestException 。

1.4 通用框架

注意

Try

Exception

R.raise_for_status()

def getHTMLText(url):

    try:

        r=requests.get(url,timeout=30)

        r.raise_for_status()  # 如果状态不是200，引发error异常

        # print("%d\n %s" % (r.status_code, r.text))

        print("%s %s" % (r.encoding, r.apparent_encoding))

        r.encoding=r.apparent_encoding

        print("%s %s" % (r.encoding, r.apparent_encoding))

        #html = r.content  # bytes 类型

        #html_doc = str(html, 'utf-8')  # html_doc=html.decode("utf-8","ignore")

        #print(html_doc)

        return r.text

    except:

        return "产生异常"

1.5 requests的方法 //http的操作

注意method的function的区别

def getHTMLText(url):

    try:

        r=requests.get(url,timeout=30) #reponse   参数 timeout

        r.raise_for_status()  # 如果状态不是200，引发error异常

        # print("%d\n %s" % (r.status_code, r.text))

        print("%s %s" % (r.encoding, r.apparent_encoding))

        r.encoding=r.apparent_encoding

        print("%s %s" % (r.encoding, r.apparent_encoding))

        #html = r.content  # bytes 类型

        #html_doc = str(html, 'utf-8')  # html_doc=html.decode("utf-8","ignore")

        #print(html_doc)

        print(r.text)

        return r.text

    except:

        return "产生异常"

def head(url):

    r=requests.head(url)

    print(r.headers)    # 注意head headers

    print(r.text)   #空

def post(url): #追加

    r=requests.get("http://httpbin.org/post")

    print(r.text)

    payload = {'name': 'your_name', 'ID': 'your_student number'}

    r = requests.post("http://httpbin.org/post", data=payload)   #参数 data

    print(r.text)

def put(url):   #覆盖

    r = requests.get("http://httpbin.org/put")

    print(r.text)

    payload = {'name': 'your_name', 'ID': '123456'}

    r = requests.put("http://httpbin.org/put", data=payload)

    print(r.text)

1.6 Request 访问控制字段 Requests.request(method,url,**kwargs)

标准格式 Requests.request(method,url,**kwargs)

**kwargs:控制访问的参数，均为可选项，共计13个

params：  字典或字节序列，作为参数增加到url中

data：       字典、字节序列或文件对象，作为Request的内容

JSON：    JSON格式的数据，作为Request的内容

headers： 字典，HTTP定制头。可模拟任何浏览器向服务器发起请求

           hd={'user-agent':'Chrome/56.0'}

           r=requests.request('post','https://www.amazon.com/',headers=hd)

Cookies：字典或CookieJar ， Request 中 的 cookie

auth ：     元组 ，支持HTTP认证功能

files :        字典类型，传输文件

timeout :   设定超时时间,单位为秒

proxles            ：  字典类型 ，设定访问代理服务器，可以增加登录认证

Allowredirects： True/Fa1se，默认为True，重定向开关

stream             ： True/Fa1se，默认为True，获取内容立即下载开关

verify              ： True/Fa1se，默认为True，认证SSL证书开关

Cert                 ：本地SSL证书路径

1.7 爬虫尺寸

网页：requests

网站：scrapy

全网：搜索引擎

1.8 robots协议

Robots协议（也称为爬虫协议、机器人协议等）的全称是“网络爬虫排除标准”（Robots Exclusion Protocol），网站通过Robots协议告诉搜索引擎哪些页面可以抓取，哪些页面不能抓取。

https://www.jd.com/robots.txt

User-agent: *

Disallow: /?*

Disallow: /pop/*.html

Disallow: /pinpai/*.html?*

User-agent: EtaoSpider

Disallow: /

User-agent: HuihuiSpider

Disallow: /

User-agent: GwdangSpider

Disallow: /

User-agent: WochachaSpider

Disallow: /

*代表所有，/代表根目录

User-agent: *

Disallow: /

下面四种爬虫被京东认为恶意爬虫，拒接其访问

1.9 chrome 查看useragent

F12 network name

2.requests的例子

import requests

import os

def amazon():

    #url="https://www.amazon.cn"

    # r=requests.get(url)

    # print(r.status_code)

    #url="https://www.amazon.com"

    #理论上python直接爬，可以看到requests请求很诚实的告诉了网站访问使用Python发起的，

    # 该网站通过头信息判断该访问是爬虫发起的而不是由浏览器发起的。amazon会503，使用useragent模拟浏览器后没问题

    #问题是直接10060.

    #url = "https://www.amazon.co.jp"

    # try:

    #      r=requests.get(url)

    #      #r = requests.get(url,timeout=5)

    #       print(r.request.headers)  #头信息

    #      #print(r.request.url)

    #      #r.raise_for_status()

    #      print(r.status_code)

    # except:

    #      print("except %s"% r.status_code)

    # print(r.request.headers)  #   注意是request 网站通过头信息判断是python发起，爬虫，拒绝

    #hd = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'}

    #r = requests.request('post', url=url, headers=hd)

    #r = requests.get(url, headers=hd)

    #print("final %s"% r.status_code)

    #上面是网络问题导致的amazon访问不了，我还以为是代码问题改了很久...下面这样做就行 了

    url = "https://www.amazon.com"

    r=requests.get(url)

    print("%s %s"%(r.status_code,r.request.headers))  #注意是request.headers不是requests

    #503 {'User-Agent': 'python-requests/2.21.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

    hd = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'}

    #r = requests.request('post', url=url, headers=hd) #请求方式是post，返回状态码405，后台不允许post

    r = requests.get(url, headers=hd)

    print("%s %s" % (r.status_code, r.request.headers))  #200

def searchengine():

    keyword = "知乎"

    try:

        kv = {'wd': keyword}

        r = requests.get("http://www.baidu.com/s", params=kv)

        print(r.request.url)

        r.raise_for_status()

        print(r.text[1:1000])

    # 结果太长，打印前1000个字符

    except:

        print("爬取失败")

    # 百度直接搜索 武汉大学，华科

    # https: // www.baidu.com / s?wd = 武汉大学 & rsv_spt = 1……

    # https: // www.baidu.com / s?wd = 华中科技大学 & rsv_spt = 1……

    # 所以只需要替换wd即可搜索

    #

def images():

    #可以通过循环语句，批量爬取大量图片  正则式也可

    url = "https://meowdancing.com/images/timg.jpg"

    root = "F://Pictures//"

    path = root + url.split('/')[-1]  #split 通过 / 分片，取最后一片也就是timg.jpg

    try:

        if not os.path.exists(root):

            os.mkdir(root)  # 用于以数字权限模式创建目录

        if not os.path.exists(path):

            r = requests.get(url)

            with open(path, 'wb')as f:

                f.write(r.content)

                f.close()

                print("文件保存成功")

        else:  # 写代码时注意缩进

            print("文件已存在")

    except:

        print("爬取失败")

def ipaddress():

    url = "http://www.ip138.com/ips138.asp?ip="

    ip="101.24.190.228"

    url=url+ip

    #   +"&action=2" 不加也可以

    hd = {

        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'}

    print(url)

    try:

        r = requests.get(url,headers=hd)   #不加hd好像不行

        print(r.status_code)

        r.raise_for_status()

        r.encoding = r.apparent_encoding

        print(r.text[-2000:])  # 输出最后2000个字符

    except:

        print("爬取失败")

    # 打开

    # http: // www.ip138.com / 可以通过输入IP地址查询地理位置，输入IP地址后，查看浏览器链接

    # http: // www.ip138.com / ips138.asp?ip = 202.114

    # .66

    # .96 & action = 2

    # 可以看出，查询链接为

    # http: // www.ip138.com / ips138.asp?ip =“你的IP地址”

    #

    # 通过这个例子我们可以看出，很多人机交互的操作，实际上是通过提交的HTTP链接来完成的，

    # 因此当我门通过简单的分析，得知HTTP链接与交互信息的对应关系后，就可以通过Python，爬取我们所需的资源

if __name__ == "__main__":

    #amazon()

    #searchengine()

    #images()

    ipaddress()

爬虫入门一基础知识以及request的更多相关文章

2.Docker容器学习之新生入门必备基础知识
0x02 Docker 核心概念描述:Docker的三大核心概念镜像/容器和仓库, 通过三大对象核心概念所构建的高效工作流程; 1.镜像 [image] 描述:images 类似于虚拟机镜像,借鉴了 ...
1.Docker容器学习之新生入门必备基础知识
0x00 Docker 快速入门 1.基础介绍描述:Docker [ˈdɑ:kə(r)] 是一个基于Go语言开发实现的遵循Apache 2.0协议开源项目,目标是实现轻量级的操作系统虚拟化解决方案: ...
Nginx入门篇-基础知识与linux下安装操作
我们要深刻理解学习NG的原理与安装方法,要切合实际结合业务需求,应用场景进行灵活使用. 一.Nginx知识简述Nginx是一个高性能的HTTP服务器和反向代理服务器,也是一个 IMAP/POP3/SM ...
01慕课网《vue.js2.5入门》——基础知识
前端框架 Vue.js2.5 2018-05-12 Vue官网:https://cn.vuejs.org/ 基础语法+案例实践+TodoList+Vue-cli构建工具+TodoList Vue基础语 ...
Python爬虫入门（基础实战）—— 模拟登录知乎
模拟登录知乎这几天在研究模拟登录, 以知乎 - 与世界分享你的知识.经验和见解为例.实现过程遇到不少疑问,借鉴了知乎xchaoinfo的代码,万分感激! 知乎登录分为邮箱登录和手机登录两种方式,通过 ...
React 基础入门，基础知识介绍
React不管在demo渲染还是UI上,都是十分方便,本人菜鸟试试学习一下,结合阮一峰老师的文章,写下一点关于自己的学习react的学习笔记,有地方不对的地方,希望各位大牛评论指出: PS:代码包下载 ...
hadoop入门必备基础知识
1.对Linux 系统的要求会基本的命令: (1)知道root用户 (2)ls命令会查看文件夹内容 (3)cd命令等2.Java 的要求 ...
SQLAlchemy 快速入门、基础知识
SQLAlchemy 是Python 编程语言下的一款开源软件.提供了SQL工具包及对象关系映射(ORM)工具. ORM, 全称Object Relational Mapping, 中文叫做对象关系映 ...
Android宝典入门篇-基础知识
今天跟大家分享的是我学android时的笔记.以前搞net很多年了,现在还在搞这.本着活到老学到老抽了点时间学习了下android.android网上有很多的视频教程,当时对于我这样以前不了解java ...

随机推荐

pandas DataFrame的新增行列，修改、删除、筛选、判断元素以及转置操作
1)指定行索引和列索引标签 index 属性可以指定 DataFrame 结构中的索引数组, columns 属性可以指定包含列名称的行, 而使用 name 属性,通过对一个 DataFrame 实 ...
Scrapy——將爬取圖片下載到本地
1. Spider程序: 1 import scrapy, json 2 from UnsplashImageSpider.items import ImageItem 3 4 class Unspl ...
前端面试之JavaScript中数组的方法！【残缺版！！】
前端面试之JavaScript中数组常用的方法 7 join Array.join()方法将数组中所有元素都转化为字符串并连接在-起,返回最后生成的字符串.可以指定一个可选的字符串在生成的字符串中来 ...
Spring Security OAuth2.0认证授权六：前后端分离下的登录授权
历史文章 Spring Security OAuth2.0认证授权一:框架搭建和认证测试 Spring Security OAuth2.0认证授权二:搭建资源服务 Spring Security OA ...
I/O 复用 multiplexing data race 同步 coroutine 协程
小结: 1.A file descriptor is considered ready if it is possible to perform the corresponding I/O opera ...
WireGuard 教程：使用 DNS-SD 进行 NAT-to-NAT 穿透
原文链接:https://fuckcloudnative.io/posts/wireguard-endpoint-discovery-nat-traversal/ WireGuard 是由 Jason ...
SpringMVC听课笔记（二：SpringMVC的 HelloWorld）
1.如何建Maven web项目,请看http://how2j.cn/k/maven/maven-eclipse-web-project/1334.html 2.Maven项目,pom文件中的jar包 ...
docker基本使用-常用命令
一. 常用命令 #查看docker服务 docker ps #启动docker服务 systemctl start docker #查看本地镜像 docker images #删除本地镜像 docke ...
浅谈JavaScript异步编程
单线程模式我们知道JS的执行环境是单线程的,是因为JS语言最早是运行在浏览器端的语言,目的是为了实现页面上的动态交互.实现动态交互的核心就是DOM操作,因此决定了JS必须是单线程模式工作.我们来假设 ...
Poem 01（转）
Dear Sunshine The way you glow through my blinds in the morning. It makes me feel like you missed me ...

爬虫入门一 基础知识 以及request

title: 爬虫入门一 基础知识 以及request date: 2020-03-05 14:43:00 categories: python tags: crawler