Spider_基础总结3_BeautifulSoup对象+find()+find_all()
# 本节内容:
# 解析复杂的 HTML网页:
# 1--bs.find() bs.find_all() tag.get_text()
# find_all(tag/tag_list,attributes_dict,recursive,text,limit,keywords)
# find(tag/tag_list,attributes_dict,recursive,text,keywords)
# 2--CSS选择器(导航树): 一般与 bs.find() bs.find_all()搭配使用
# tag.children tag.descendants tag.next_siblings tag.previous_siblings tag.parent
# 3--BeautifulSoup对象:
# beautifulsoup对象 bs
# Tag对象(包含单个Tag或者 Tag列表)
# NavigableString 对象 表示标签里的文字,而不是标签本身
# Comment对象 用来查找 HTML 文档的注释标签,<!--像这样-->
# 解析复杂的 html网页时,我们使用 beautifulsoup利用 css的样式属性可以轻松地区分出不同的标签来:
# bs.find() bs.findall() tag.get_text()
# 一,引子:
import requests
from requests import exceptions
from bs4 import BeautifulSoup
html = requests.get('http://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html.text, 'html.parser')
# print(bs)
nameList = bs.findAll('span', {'class': 'green'}) # bs.findall(tag/tag_list,attributes_dict) 返回以 满足条件的 tag的列表
for name in nameList:
print(name.get_text()) # tag.get_text() 最后使用 get_text(),一般情况下我们保留 HTML的标签结构
Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna
# 二,通过标签的名称和属性来查找标签:
# bs.findall()与 bs.find() (后者相当于前者 limit=1的情况)
# find_all(tag/tag_list,attributes_dict,recursive,text,limit,keywords)
# find(tag/tag_list,attributes_dict,recursive,text,keywords)
# tag/tag_list (标签或标签列表)-- 如:‘span’ 或 ['h1','h2','p']
# attributes_dict (属性字典)-- 如: {'class':'green'} 再如:{'class':{'green', 'red'}}
# recursive (递归 ) -- 默认为 True---表示 查找指定的tag/tag_list及其子标签...
# text (文本参数 ) -- text=‘指定要查找的文本内容’ 而不使用 标签的属性 返回的是 NavigableString,而不是标签对象。
# limit (限制匹配次数 )--注意是,按照网页上的顺序排序之后抓取指定的次数的标签,未必是你想要的那前几项。
# keywords--可以设置一个或多个 keyword来进一步限制匹配的标签,如 id='Tiltle' class_='green'等。 (为与python中的关键字区分,bs规定加个_)
# 示例 1:
titles = bs.find_all(['h1', 'h2','h3','h4','h5','h6'])
print([title for title in titles]) # [<h1>War and Peace</h1>, <h2>Chapter 1</h2>]
prince=bs.find(text='the prince')
print(type(prince)) # <class 'bs4.element.NavigableString'>
prince_list=bs.find_all(text='the prince')
print(prince_list)
print([prince for prince in prince_list])
[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]
<class 'bs4.element.NavigableString'>
['the prince', 'the prince', 'the prince', 'the prince', 'the prince', 'the prince', 'the prince']
['the prince', 'the prince', 'the prince', 'the prince', 'the prince', 'the prince', 'the prince']
# 示例 2:
allText = bs.find_all(id='title', class_='text')
print(allText)
print([text for text in allText])
[]
[]
# 三,BeautifulSoup对象:
# 1-beautifulsoup对象 bs
# 2-Tag对象(包含单个Tag或者 Tag列表)
# 3-NavigableString 对象 表示标签里的文字,而不是标签本身
# 4-Comment对象 用来查找 HTML 文档的注释标签,<!--像这样-->
# 四,导航树:子标签,后代标签,兄弟标签,父标签
# find_all()与find()是通过标签的名称和属性来查找标签,我们还可以通过标签的位置来查找:
# 1)单一方向: bs.tag.subtag.anothersubtag
# 2) 导航树:纵向和横向导航
# 1-- 子标签: .children
import requests
from bs4 import BeautifulSoup
html = requests.get('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.text, 'html.parser')
for child in bs.find('table',{'id':'giftList'}).children:
print(child)
print('--------------------------------------------')
--------------------------------------------
<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg"/>
</td></tr>
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg"/>
</td></tr>
--------------------------------------------
--------------------------------------------
# 2-- 后代标签: .descendants
import requests
from bs4 import BeautifulSoup
html = requests.get('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.text, 'html.parser')
for child in bs.find('table',{'id':'giftList'}).descendants: # 查找第一个时,bs.table.tr 或 bs.tr也行,但不具体,如果网页变化,容易丢失
print(child)
print('--------------------------------------------')
--------------------------------------------
<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>
--------------------------------------------
<th>
Item Title
</th>
--------------------------------------------
Item Title
--------------------------------------------
<th>
Description
</th>
--------------------------------------------
Description
--------------------------------------------
<th>
Cost
</th>
--------------------------------------------
Cost
--------------------------------------------
<th>
Image
</th>
--------------------------------------------
Image
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
--------------------------------------------
<td>
Vegetable Basket
</td>
--------------------------------------------
Vegetable Basket
--------------------------------------------
<td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td>
--------------------------------------------
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
--------------------------------------------
<span class="excitingNote">Now with super-colorful bell peppers!</span>
--------------------------------------------
Now with super-colorful bell peppers!
--------------------------------------------
--------------------------------------------
<td>
$15.00
</td>
--------------------------------------------
$15.00
--------------------------------------------
<td>
<img src="../img/gifts/img1.jpg"/>
</td>
--------------------------------------------
--------------------------------------------
<img src="../img/gifts/img1.jpg"/>
--------------------------------------------
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>
--------------------------------------------
<td>
Russian Nesting Dolls
</td>
--------------------------------------------
Russian Nesting Dolls
--------------------------------------------
<td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td>
--------------------------------------------
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"!
--------------------------------------------
<span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
--------------------------------------------
8 entire dolls per set! Octuple the presents!
--------------------------------------------
--------------------------------------------
<td>
$10,000.52
</td>
--------------------------------------------
$10,000.52
--------------------------------------------
<td>
<img src="../img/gifts/img2.jpg"/>
</td>
--------------------------------------------
--------------------------------------------
<img src="../img/gifts/img2.jpg"/>
--------------------------------------------
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>
--------------------------------------------
<td>
Fish Painting
</td>
--------------------------------------------
Fish Painting
--------------------------------------------
<td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td>
--------------------------------------------
If something seems fishy about this painting, it's because it's a fish!
--------------------------------------------
<span class="excitingNote">Also hand-painted by trained monkeys!</span>
--------------------------------------------
Also hand-painted by trained monkeys!
--------------------------------------------
--------------------------------------------
<td>
$10,005.00
</td>
--------------------------------------------
$10,005.00
--------------------------------------------
<td>
<img src="../img/gifts/img3.jpg"/>
</td>
--------------------------------------------
--------------------------------------------
<img src="../img/gifts/img3.jpg"/>
--------------------------------------------
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg"/>
</td></tr>
--------------------------------------------
<td>
Dead Parrot
</td>
--------------------------------------------
Dead Parrot
--------------------------------------------
<td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td>
--------------------------------------------
This is an ex-parrot!
--------------------------------------------
<span class="excitingNote">Or maybe he's only resting?</span>
--------------------------------------------
Or maybe he's only resting?
--------------------------------------------
--------------------------------------------
<td>
$0.50
</td>
--------------------------------------------
$0.50
--------------------------------------------
<td>
<img src="../img/gifts/img4.jpg"/>
</td>
--------------------------------------------
--------------------------------------------
<img src="../img/gifts/img4.jpg"/>
--------------------------------------------
--------------------------------------------
--------------------------------------------
<tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg"/>
</td></tr>
--------------------------------------------
<td>
Mystery Box
</td>
--------------------------------------------
Mystery Box
--------------------------------------------
<td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td>
--------------------------------------------
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining.
--------------------------------------------
<span class="excitingNote">Keep your friends guessing!</span>
--------------------------------------------
Keep your friends guessing!
--------------------------------------------
--------------------------------------------
<td>
$1.50
</td>
--------------------------------------------
$1.50
--------------------------------------------
<td>
<img src="../img/gifts/img6.jpg"/>
</td>
--------------------------------------------
--------------------------------------------
<img src="../img/gifts/img6.jpg"/>
--------------------------------------------
--------------------------------------------
--------------------------------------------
# 3-- 兄弟标签:next_siblings 和 previous_sibling
import requests
from bs4 import BeautifulSoup
html = requests.get('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.text, 'html.parser')
for sibling in bs.find('table', {'id':'giftList'}).tr.next_siblings:
print(sibling)
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>
<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>
<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg"/>
</td></tr>
<tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg"/>
</td></tr>
# 4-- 父标签:.parent 用的比较少
# 查找图片 '../img/gifts/img1.jpg'对应的商品的价格:
import requests
from bs4 import BeautifulSoup
html = requests.get('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html.text, 'html.parser')
print(bs.find('img',
{'src':'../img/gifts/img1.jpg'})
.parent.previous_sibling.get_text()) # 兄弟标签和父标签
$15.00
Spider_基础总结3_BeautifulSoup对象+find()+find_all()的更多相关文章
- 第31节:Java基础-类与对象
前言 Java基础-类与对象,方法的重载,构造方法的重载,static关键字,main()方法,this关键字,包,访问权限,类的继承,继承性,方法的重写,super变量. 方法的重载:成员方法的重载 ...
- Java基础-IO流对象之压缩流(ZipOutputStream)与解压缩流(ZipInputStream)
Java基础-IO流对象之压缩流(ZipOutputStream)与解压缩流(ZipInputStream) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 之前我已经分享过很多的J ...
- Java基础-IO流对象之随机访问文件(RandomAccessFile)
Java基础-IO流对象之随机访问文件(RandomAccessFile) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.RandomAccessFile简介 此类的实例支持对 ...
- Java基础-IO流对象之内存操作流(ByteArrayOutputStream与ByteArrayInputStream)
Java基础-IO流对象之内存操作流(ByteArrayOutputStream与ByteArrayInputStream) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.内存 ...
- Java基础-IO流对象之数据流(DataOutputStream与DataInputStream)
Java基础-IO流对象之数据流(DataOutputStream与DataInputStream) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.数据流特点 操作基本数据类型 ...
- Java基础-IO流对象之打印流(PrintStream与PrintWriter)
Java基础-IO流对象之打印流(PrintStream与PrintWriter) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.打印流的特性 打印对象有两个,即字节打印流(P ...
- Java基础-IO流对象之序列化(ObjectOutputStream)与反序列化(ObjectInputStream)
Java基础-IO流对象之序列化(ObjectOutputStream)与反序列化(ObjectInputStream) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.对象的序 ...
- java基础-IO流对象之Properties集合
java基础-IO流对象之Properties集合 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.Properties集合的特点 Properties类表示了一个持久的属性集. ...
- Java基础-IO流对象之字符缓冲流(BufferedWriter与BufferedReader)
Java基础-IO流对象之字符缓冲流(BufferedWriter与BufferedReader) 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.字符缓冲流 字符缓冲流根据流的 ...
随机推荐
- intellij idea如何解决javax.servlet.http不存在
正确的解决方法是:对项目名右键,选中Open Mudule Settings--选择左侧的Modules,选择右边的Dependencies--然后点击右侧边栏的绿色"+"号,点击 ...
- 蒲公英 · JELLY技术周刊 Vol.25 · Webpack 5 正式发布,你学废了么
蒲公英 · JELLY技术周刊 Vol.25 阔别两年,Webpack 5 正式发布了,不仅清理掉很多冗余的功能,同样也为我们带来了很多新鲜的能力,不论是默认开启的持久缓存,还是反病毒保护,亦或者被其 ...
- 【最短路】CF 938D Buy a Ticket
题目大意 流行乐队"Flayer"将在\(n\)个城市开演唱会,这\(n\)个城市的人都想去听演唱会,每个城市的票价不同,于是这些人就想是否能去其他城市听演唱会更便宜,但是去其他的 ...
- 最近集训的图论(思路+实现)题目汇总(内容包含tarjan、分层图、拓扑、差分、奇怪的最短路):
(集训模拟赛2)抢掠计划(tarjan强) 题目:给你n个点,m条边的图,每个点有点权,有一些点是"酒吧"点,终点只能在"酒吧",起点给定,路可以重复经过,但点 ...
- 转 RabbitMQ 入门教程(PHP版) 使用rabbitmq-delayed-message-exchange插件实现延迟功能
延迟任务应用场景 场景一:物联网系统经常会遇到向终端下发命令,如果命令一段时间没有应答,就需要设置成超时. 场景二:订单下单之后30分钟后,如果用户没有付钱,则系统自动取消订单. 场景三:过1分钟给新 ...
- composer慢 设置阿里云镜像
composer config -g repo.packagist composer https://mirrors.aliyun.com/composer
- Promises/A+规范
为什么需要异步编程方式 一个函数执行之后,在它后面顺序编写的代码中,如果能够直接使用它的返回结果或者它修改之后的引用参数,那么我们通常认为该函数是同步的. 如果一个函数的执行结果或者其修改的引用参数, ...
- Storm入门教程汇总
http://www.aboutyun.com/thread-8059-1-1.html
- Codeforces Round 665 赛后解题报告(暂A-D)
Codeforces Round 665 赛后解题报告 A. Distance and Axis 我们设 \(B\) 点 坐标为 \(x(x\leq n)\).由题意我们知道 \[\mid(n-x)- ...
- 看完本文若不能让你学通“Python”,我将永远退出IT界
学Python,切忌今天这学一点,明天那里学一点,零零散散没有系统的学习.这样不仅耽搁大家时间,久而久之也会消磨大家学习的兴致!这里给大家总结了一张系统的Python学习路线图!希望大家共勉! Pyt ...