9.3.4 BeaufitulSoup4
BeautifulSoup 是一个非常优秀的Python扩展库,可以用来从HTML或XML文件中提取我们感兴趣的数据,并且允许指定使用不同的解析器。
使用 pip install BeaufifulSoup4 直接进行模块的安装。安装之后应使用 from bs4 import BeautifulSoup 导入并使用。
下面简单演示下BeautifulSoup4的功能,更加详细完整的学习资料请参考 https://www.crummy.com/software/BeautifulSoup/bs4/doc/。
>>> from bs4 import BeautifulSoup
>>>
>>> #自动添加和补全标签
>>> BeautifulSoup('hello world','lxml')
<html><body><p>hello world</p></body></html>
>>>
>>> #自定义一个html文档内容
>>> html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p> <p class="story">...</p>
"""
>>>
>>> #解析这段html文档内容,以优雅的方式展示出来
>>> soup = BeautifulSoup(html_doc,'html.parser')
>>> print(soup.prettify())
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
>>>
>>> #访问特定标签
>>> soup.title
<title>The Dormouse's story</title>
>>>
>>> #标签名字
>>> soup.title.name
'title'
>>>
>>> #标签文本
>>> soup.title.text
"The Dormouse's story"
>>>
>>> #title标签的上一级标签
>>> soup.title.parent
<head><title>The Dormouse's story</title></head>
>>>
>>> soup.head
<head><title>The Dormouse's story</title></head>
>>>
>>> soup.b
<b>The Dormouse's story</b>
>>>
>>> soup.b.name
'b'
>>> soup.b.text
"The Dormouse's story"
>>>
>>> #把整个BeautifulSoup对象看作标签对象
>>> soup.name
'[document]'
>>>
>>> soup.body
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
>>>
>>> soup.p
<p class="title"><b>The Dormouse's story</b></p>
>>>
>>> #标签属性
>>> soup.p['class']
['title']
>>>
>>> soup.p.get('class') #也可以这样查看标签属性
['title']
>>>
>>> soup.p.text
"The Dormouse's story"
>>>
>>> soup.p.contents
[<b>The Dormouse's story</b>]
>>>
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>>
>>> #查看a标签所有属性
>>> soup.a.attrs
{'class': ['sister'], 'id': 'link1', 'href': 'http://example.com/elsie'}
>>>
>>> #查找所有a标签
>>> soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>>
>>> #同时查找<a>和<b>标签
>>> soup.find_all(['a','b'])
[<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>>
>>> import re
>>> #查找href包含特定关键字的标签
>>> soup.find_all(href=re.compile("elsie"))
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
>>>
>>> soup.find(id='link3')
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
>>>
>>> soup.find_all('a',id='link3')
[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>>
>>> for link in soup.find_all('a'):
print(link.text,':',link.get('href')) Elsie : http://example.com/elsie
Lacie : http://example.com/lacie
Tillie : http://example.com/tillie
>>>
>>> print(soup.get_text()) #返回所有文本 The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters;and their names were
Elsie,
Lacieand
Tillie;
and they lived at the bottom of a well.
... >>>
>>> #修改标签属性
>>> soup.a['id']='test_link1'
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="test_link1">Elsie</a>
>>>
>>> #修改标签文本
>>> soup.a.string.replace_with('test_Elsie')
'Elsie'
>>>
>>> soup.a.string
'test_Elsie'
>>>
>>> print(soup.prettify())
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="test_link1">
test_Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
>>>
>>>
>>> #遍历子标签
>>> for child in soup.body.children:
print(child) <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="test_link1">test_Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p> <p class="story">...</p> >>>
9.3.4 BeaufitulSoup4的更多相关文章
随机推荐
- Android 布局属性大全
常用属性汇总: android:hint 设置EditText为空时输入的提示信息 android:gravity 对该view内容的限定:靠上 下 左 右 andro ...
- Bing Maps进阶系列五:通过DeepEarth的MiniMap控件为Bing Maps扩展迷你小地图
Bing Maps进阶系列五:通过DeepEarth的MiniMap控件为Bing Maps扩展迷你小地图 Bing Maps Silverlight Control虽然为我们提供了简洁.方便的开发模 ...
- MFC基础学习
RECT rect = { }; //获取窗口的内部客户区矩形 GetClientRect(&rect); 模态和费模态对话框! 模态对话框只需要包含对话框头文件,定义对话框类,调用DoMod ...
- go语言笔记——是c开发的 lex yacc进行词法和语法分析,go不支持函数和运算符重载,不支持类型继承,也不支持断言,还有泛型
从 Go 1.0.3 版本开始,不再使用 8g,8l 之类的指令进行程序的构建,取而代之的是统一的 go build 和 go install 等命令,而这些指令会自动调用相关的编译器或链接器. 如果 ...
- Datatable转换为Json 的方法
/// <summary> /// Datatable转换为Json /// </summary> /// <param n ...
- STM32:SWD下载方式
最近没事干做了个STM32小板子,芯片是STM32VBT6,下载方式用的SWD,比JTAG节省空间 我用了五根线,3.3V,GND,RESET,SWDIO,SWCLK, JTAG 接口pin map: ...
- PCB javascript解析钻孔(Excellon)格式实现方法
解析钻孔(Excellon)格式前首先得了解此格式,这样才能更好的解析呀. 一个钻孔里面包含的基本信息如下: 1.单位:公式mm,英制inch 2.省零方式:前省零,后省零 3.坐标方式:绝对坐标,相 ...
- tp 3.2 组合查询, 字符串模式查询
$User = M("User"); // 实例化User对象 $map['id'] = array('neq',1); $map['name'] = 'ok'; $map['_s ...
- Linux 命令多到记不住?这个开源项目帮你一网打尽!
本文首发于我的公众号 Linux云计算网络(id: cloud_dev),专注于干货分享,号内有 10T 书籍和视频资源,后台回复「1024」即可领取,欢迎大家关注,二维码文末可以扫. 最近发现了一个 ...
- 个人作业 - Alpha 项目测试
写在前面 课程链接:https://edu.cnblogs.com/campus/xnsy/SoftwareEngineeringClass2 作业要求:https://edu.cnblogs.com ...