python爬虫之requests+selenium+BeautifulSoup

前言：

环境配置：windows64、python3.4
requests库基本操作：

1、安装：pip install requests

2、功能：使用 requests 发送网络请求，可以实现跟浏览器一样发送各种HTTP请求来获取网站的数据。

3、命令集操作：

import requests  # 导入requests模块

r = requests.get("https://api.github.com/events")  # 获取某个网页

# 设置超时，在timeout设定的秒数时间后停止等待响应

r2 = requests.get("https://api.github.com/events", timeout=0.001)

payload = {'key1': 'value1', 'key2': 'value2'}

r1 = requests.get("http://httpbin.org/get", params=payload)

print(r.url)  # 打印输出url

print(r.text)  # 读取服务器响应的内容

print(r.encoding)  # 获取当前编码

print(r.content)  # 以字节的方式请求响应体

print(r.status_code)  # 获取响应状态码

print(r.status_code == requests.codes.ok)  # 使用内置的状态码查询对象

print(r.headers)  # 以一个python字典形式展示的服务器响应头

print(r.headers['content-type'])  # 大小写不敏感，使用任意形式访问这些响应头字段

print(r.history)  # 是一个response对象的列表

print(type(r))  # 返回请求类型

BeautifulSoup4库基本操作：

1、安装：pip install BeautifulSoup4

2、功能：Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库。

3、命令集操作：

 import requests

 from bs4 import BeautifulSoup

 html_doc = """

 <html><head><title>The Dormouse's story</title></head>

 <body>

 <p class="title"><b>The Dormouse's story</b></p>

 <p class="story">Once upon a time there were three little sisters; and their names were

 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

 and they lived at the bottom of a well.</p>

 <p class="story">...</p>

 """

 ss = BeautifulSoup(html_doc,"html.parser")

 print (ss.prettify())  #按照标准的缩进格式的结构输出

 print(ss.title)   # <title>The Dormouse's story</title>

 print(ss.title.name)   #title

 print(ss.title.string)   #The Dormouse's story

 print(ss.title.parent.name)   #head

 print(ss.p)   #<p class="title"><b>The Dormouse's story</b></p>

 print(ss.p['class'])   #['title']

 print(ss.a)   #<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

 print(ss.find_all("a"))   #[。。。]

 print(ss.find(id = "link3"))   #<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

 for link in ss.find_all("a"):

     print(link.get("link")) #获取文档中所有<a>标签的链接

 print(ss.get_text()) #从文档中获取所有文字内容

 import requests

 from bs4 import BeautifulSoup

 html_doc = """

 <html><head><title>The Dormouse's story</title></head>

 <body>

 <p class="title"><b>The Dormouse's story</b></p>

 <p class="story">Once upon a time there were three little sisters; and their names were

 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

 and they lived at the bottom of a well.</p>

 <p class="story">...</p>

 """

 soup = BeautifulSoup(html_doc, 'html.parser')  # 声明BeautifulSoup对象

 find = soup.find('p')  # 使用find方法查到第一个p标签

 print("find's return type is ", type(find))  # 输出返回值类型

 print("find's content is", find)  # 输出find获取的值

 print("find's Tag Name is ", find.name)  # 输出标签的名字

 print("find's Attribute(class) is ", find['class'])  # 输出标签的class属性值

 print(find.string)  # 获取标签中的文本内容

 markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"

 soup1 = BeautifulSoup(markup, "html.parser")

 comment = soup1.b.string

 print(type(comment))  # 获取注释中内容

小试牛刀：

 import requests

 import io

 import sys

 sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030') #改变标准输出的默认编码

 r = requests.get('https://unsplash.com') #像目标url地址发送get请求，返回一个response对象

 print(r.text) #r.text是http response的网页HTML

参考链接：

https://blog.csdn.net/u012662731/article/details/78537432

http://www.cnblogs.com/Albert-Lee/p/6276847.html

https://blog.csdn.net/enohtzvqijxo00atz3y8/article/details/78748531

python爬虫之requests+selenium+BeautifulSoup的更多相关文章

孤荷凌寒自学python第六十七天初步了解Python爬虫初识requests模块
孤荷凌寒自学python第六十七天初步了解Python爬虫初识requests模块 (完整学习过程屏幕记录视频地址在文末) 从今天起开始正式学习Python的爬虫. 今天已经初步了解了两个主要的模块: ...
Python爬虫练习(requests模块)
Python爬虫练习(requests模块) 关注公众号"轻松学编程"了解更多. 一.使用正则表达式解析页面和提取数据 1.爬取动态数据(js格式) 爬取http://fund.e ...
python爬虫动态html selenium.webdriver
python爬虫:利用selenium.webdriver获取渲染之后的页面代码! 1 首先要下载浏览器驱动: 常用的是chromedriver 和phantomjs chromedirver下载地址 ...
Python爬虫之设置selenium webdriver等待
Python爬虫之设置selenium webdriver等待 ajax技术出现使异步加载方式呈现数据的网站越来越多,当浏览器在加载页面时,页面上的元素可能并不是同时被加载完成,这给定位元素的定位增加 ...
python爬虫数据解析之BeautifulSoup
BeautifulSoup是一个可以从HTML或者XML文件中提取数据的python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式. BeautfulSoup是python爬虫三 ...
python爬虫之requests库
在python爬虫中,要想获取url的原网页,就要用到众所周知的强大好用的requests库,在2018年python文档年度总结中,requests库使用率排行第一,接下来就开始简单的使用reque ...
python爬虫（7）——BeautifulSoup
今天介绍一个非常好用的python爬虫库--beautifulsoup4.beautifulsoup4的中文文档参考网址是:http://beautifulsoup.readthedocs.io/zh ...
python爬虫之初始Selenium
1.初始 Selenium[1] 是一个用于Web应用程序测试的工具.Selenium测试直接运行在浏览器中,就像真正的用户在操作一样.支持的浏览器包括IE(7, 8, 9, 10, 11),Moz ...
PYTHON 爬虫笔记七:Selenium库基础用法
知识点一:Selenium库详解及其基本使用什么是Selenium selenium 是一套完整的web应用程序测试系统,包含了测试的录制(selenium IDE),编写及运行(Selenium ...

随机推荐

Samba简单配置－－匿名用户共享资料可读可写的实现
http://e-mailwu.blog.163.com/blog/static/65104036200931893921923/ http://www.cnblogs.com/god_like_do ...
HDU 1031.Design T-Shirt【结构体二次排序】【8月21】
Design T-Shirt Problem Description Soon after he decided to design a T-shirt for our Algorithm Board ...
排队理论之性能分析 - Little Law & Utilization Law
了解一个系统的性能一般是參考一些度量值(Metric),而怎样计算出这些Metric就是我们要讨论的.Little Law(排队理论:利特儿法则)和Utilization Law是Performanc ...
移动端html5页面长按实现高亮全选文本内容的兼容解决方式
近期须要给html5的WebAPP在页面上实现一个复制功能:用户点击长按文本会全选文字并弹出系统"复制"菜单.用户能够点击"复制"进行复制操作.然后粘贴到App ...
[IT新应用]无线投影技术
会议室内投影时,经常会有笔记本与投影仪之间因兼容性等无法切换的现象. 了解了下,无线投影方案的厂家大致如下: 1.http://www.taco.net.cn/ 2.巴可无线投影 https://ww ...
Filter 详解
一.Filter简介 Filter也称之为过滤器,它是Servlet技术中最激动人心的技术,WEB开发人员通过Filter技术,对web服务器管理的所有web资源:例如Jsp, Servlet, 静态 ...
keywords和favicon
1.<meta name="keywords" content="xxx"> 曾经网站风靡关键词堆积,往往在首页上设置大量的关键词,以获取最大范围搜 ...
caioj1421&&hdu2167: [视频]【状态压缩】选数
%hz大佬..这道题的状态压缩简直匪夷所思(其实是我孤陋寡闻,而且我以前的博客竟然写了这题..水啊) 嗯这题可以发现,我们可以用一个二进制表示一行的状态,1表示选0反之,可以发现行与行之间可选的范围是 ...
Java 内部类理解
为什么使用内部类? 答:每个内部类都能独立地继承一个(接口的)实现,所以无论外围类是否已经继承了某个(接口的)实现,对于内部类都没有影响. 内部类有哪些? 答:内部类一般来说包括这四种:成员内部类.局 ...
并不对劲的bzoj1861: [Zjoi2006]Book 书架
传送门-> 这题的正确做法是splay维护这摞书. 但是并不对劲的人选择了暴力(皮这一下很开心). #include<algorithm> #include<cmath> ...

python爬虫之requests+selenium+BeautifulSoup

python爬虫之requests+selenium+BeautifulSoup的更多相关文章

随机推荐

热门专题