BeautifulSoup 是一个非常优秀的Python扩展库,可以用来从HTML或XML文件中提取我们感兴趣的数据,并且允许指定使用不同的解析器。

  使用 pip install BeaufifulSoup4 直接进行模块的安装。安装之后应使用 from bs4 import BeautifulSoup 导入并使用。


 >>> from bs4 import BeautifulSoup
>>> #自动添加和补全标签
>>> BeautifulSoup('hello world','lxml')
<html><body><p>hello world</p></body></html>
>>> #自定义一个html文档内容
>>> html_doc = """
<html><head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="" class="sister" id="link1">Elsie</a>,
<a href="" class="sister" id="link2">Lacie</a>and
<a href="" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p> <p class="story">...</p>
>>> #解析这段html文档内容,以优雅的方式展示出来
>>> soup = BeautifulSoup(html_doc,'html.parser')
>>> print(soup.prettify())
The Dormouse's story
<p class="title">
The Dormouse's story
<p class="story">
Once upon a time there were three little sisters;and their names were
<a class="sister" href="" id="link1">
<a class="sister" href="" id="link2">
<a class="sister" href="" id="link3">
and they lived at the bottom of a well.
<p class="story">
>>> #访问特定标签
>>> soup.title
<title>The Dormouse's story</title>
>>> #标签名字
>>> #标签文本
>>> soup.title.text
"The Dormouse's story"
>>> #title标签的上一级标签
>>> soup.title.parent
<head><title>The Dormouse's story</title></head>
>>> soup.head
<head><title>The Dormouse's story</title></head>
>>> soup.b
<b>The Dormouse's story</b>
>>> soup.b.text
"The Dormouse's story"
>>> #把整个BeautifulSoup对象看作标签对象
>>> soup.body
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="" id="link1">Elsie</a>,
<a class="sister" href="" id="link2">Lacie</a>and
<a class="sister" href="" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
>>> soup.p
<p class="title"><b>The Dormouse's story</b></p>
>>> #标签属性
>>> soup.p['class']
>>> soup.p.get('class') #也可以这样查看标签属性
>>> soup.p.text
"The Dormouse's story"
>>> soup.p.contents
[<b>The Dormouse's story</b>]
>>> soup.a
<a class="sister" href="" id="link1">Elsie</a>
>>> #查看a标签所有属性
>>> soup.a.attrs
{'class': ['sister'], 'id': 'link1', 'href': ''}
>>> #查找所有a标签
>>> soup.find_all('a')
[<a class="sister" href="" id="link1">Elsie</a>, <a class="sister" href="" id="link2">Lacie</a>, <a class="sister" href="" id="link3">Tillie</a>]
>>> #同时查找<a>和<b>标签
>>> soup.find_all(['a','b'])
[<b>The Dormouse's story</b>, <a class="sister" href="" id="link1">Elsie</a>, <a class="sister" href="" id="link2">Lacie</a>, <a class="sister" href="" id="link3">Tillie</a>]
>>> import re
>>> #查找href包含特定关键字的标签
>>> soup.find_all(href=re.compile("elsie"))
[<a class="sister" href="" id="link1">Elsie</a>]
>>> soup.find(id='link3')
<a class="sister" href="" id="link3">Tillie</a>
>>> soup.find_all('a',id='link3')
[<a class="sister" href="" id="link3">Tillie</a>]
>>> for link in soup.find_all('a'):
print(link.text,':',link.get('href')) Elsie :
Lacie :
Tillie :
>>> print(soup.get_text()) #返回所有文本 The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters;and their names were
and they lived at the bottom of a well.
... >>>
>>> #修改标签属性
>>> soup.a['id']='test_link1'
>>> soup.a
<a class="sister" href="" id="test_link1">Elsie</a>
>>> #修改标签文本
>>> soup.a.string.replace_with('test_Elsie')
>>> soup.a.string
>>> print(soup.prettify())
The Dormouse's story
<p class="title">
The Dormouse's story
<p class="story">
Once upon a time there were three little sisters;and their names were
<a class="sister" href="" id="test_link1">
<a class="sister" href="" id="link2">
<a class="sister" href="" id="link3">
and they lived at the bottom of a well.
<p class="story">
>>> #遍历子标签
>>> for child in soup.body.children:
print(child) <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="" id="test_link1">test_Elsie</a>,
<a class="sister" href="" id="link2">Lacie</a>and
<a class="sister" href="" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p> <p class="story">...</p> >>>

