Python爬虫系列-BeautifulSoup详解
安装
pip3 install beautifulsoup4
解析库
解析器 | 使用方法 | 优势 | 劣势 |
---|---|---|---|
Python标准库 | BeautifulSoup(markup,'html,parser') | Python的内置标准库、执行速度适中、文档容错能力强 | Python 2.7.3 or 3.2.2前的版本中文容错能力差 |
lxml HTML 解析库 | BeautifulSoup(markup,'lxml') | 速度快、文档容错能力强 | 需要安装C语言库 |
lxml XML 解析库 | BeautifulSoup(markup,'xml') | 速度快、唯一支持XML的解析器 | 需要安装C语言库 |
html5lib | BeautifulSoup(markup,'xml') | 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 | 速度慢、不依赖外部扩展 |
基本使用
html = """
<html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head><body><p class="title" name="dormouse"> <b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
</p> <p class="story"> ...story go on...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.prettify()
自动补全代码:
<html dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title" name="dormouse">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<!-- Elsie -->
</a>
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
; and they lived at the bottom of a well
</p>
<p class="story">
...story go on...
</p>
</body>
</html>
print(soup.title.string)
输出html的标题:
The Dormouse's story
标签选择器
选择元素
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)
输出结果如下:
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
<head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head>
<p class="title" name="dormouse"> <b>The Dormouse's story</b></p> #只返回第一个p标签
获取外层标签的名称
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title.name)
title
获取内容的属性
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])
两种获取属性名称的方法
dormouse
dormouse
获取内容
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.b.string)
The Dormouse's story
嵌套选择
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.head.title.string)
The Dormouse's story
字节点和子孙节点
html = '''<html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse\'s story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were\n <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well\n </p> <p class="story"> ...story go on...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)
['Once upon a time there were three little sisters;and their names were\n ', <a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 'and', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '; and they lived at the bottom of a well\n ']
children是一个迭代器:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.children)
for i,child in enumerate(soup.p.children):
print(i,child)
<list_iterator object at 0x7fe986ba07f0>
0 Once upon a time there were three little sisters;and their names were
1<a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --></a>
2<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
3 and
4<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
5 ; and they lived at the bottom of a well
html = '''<html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse\'s story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were\n <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well\n </p> <p class="story"> ...story go on...</p>
... '''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):
print(i,child)
孙节点也被输出出来:
<generator object descendants at 0x7fe986c11468>
0 Once upon a time there were three little sisters;and their names were
1<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a>
2
3<span>Elsie </span>
4 Elsie
5<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
6 Lacie
7 and
8<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
9 Tillie
10 ; and they lived at the bottom of a well
父节点和祖先节点
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.a.parent)
显示结果:
<p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
</p>
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.parent)))
显示结果:
[(0, 'Once upon a time there were three little sisters;and their names were\n '), (1, <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a>), (2, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (3, 'and'), (4, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (5, '; and they lived at the bottom of a well\n ')]
print(list(enumerate(soup.a.parents)))
显示所有结果:最后为源代码跟节点
[(0, <p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
</p>), (1, <body><p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
</p> <p class="story"> ...story go on...</p>
</body>), (2, <html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
</p> <p class="story"> ...story go on...</p>
</body></html>), (3, <html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
</p> <p class="story"> ...story go on...</p>
</body></html>)]
兄弟节点
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.next_siblings)))
显示如下:```html
[(0, Lacie), (1, 'and'), (2, Tillie), (3, '; and they lived at the bottom of a well\n ')]
`print(list(enumerate(soup.a.previous_siblings)))`
> `[(0, 'Once upon a time there were three little sisters;and their names were\n ')]`
## 标准选择器
### find_all(name,attrs,recursive,text,**kwargs)
可根据标签名、属性、内容查找文档
#### name
```py
html = """
<div class="panel">
<div class="panel-heading">
<h4>Helllo</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))
显示结果如下:
[
- Foo
- Bar
- Jay
,
- Foo
- Bar
]
```
>
```py
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.find_all('ul'):
print(ul.find_all('li'))
```
显示结果如下
```html
[
,
,
]
[
,
]
```
attrs
html = '''
<div class="panel">\n <div class="panel-heading">\n <h4>Helllo</h4>\n </div>\n <div class="panel-body">\n <ul class="list" id="list-1" name=elements>\n <li class="element">Foo</li>\n <li class="element">Bar</li>\n <li class="element">Jay</li>\n </ul>\n <ul class="list list-small" id="list-2">\n <li class="element">Foo</li>\n <li class="element">Bar</li>\n </ul>\n </div>\n</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(attrs={'id':'list-1'}))
print(soup.find_all(attrs={'name':'elements'}))
显示如下:
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
另外知道ID或Class可以用下列方法查找:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(id='list-1'))
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
print(soup.find_all(class_='element'))
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
text
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(text='Foo'))
['Foo', 'Foo']
find(name,attrs,recursive,text,**kwargs)
find返回单个元素,find_all返回所有元素
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find('ul'))
- Foo
- Bar
- Jay
```
print(type(soup.find('ul')))
<class 'bs4.element.Tag'>
print(type(soup.find('page')))
不存在返回结果:
<class 'NoneType'>
CSS选择器
通过select()直接传入CSS选择器即可完成选择
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(soup.select('ul')[0])
显示结果如下:
[```html
Helllo
```]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
遍历的用法:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
print(ul.select('li'))
显示结果如下:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
获取属性
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
print(ul['id'])
print(ul.attrs['id'])
显示效果如下:
list-1
list-1
list-2
list-2
获取内容
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for li in soup.select('li'):
print(li.get_text())
显示结果:
Foo
Bar
Jay
Foo
Bar
总结:
- 推荐使用lxml解析库,必要时使用html.parser
- 标签选择筛选功能弱但是速度快
- 建议使用find()、find_all()查询匹配单个结果或多个结果
- 如果对CSS选择器书系建议使用select()
- 记住常用的获取属性和文本值的方法
Python爬虫系列-BeautifulSoup详解的更多相关文章
- Python爬虫系列-Selenium详解
自动化测试工具,支持多种浏览器.爬虫中主要用来解决JavaScript渲染的问题. 用法讲解 模拟百度搜索网站过程: from selenium import webdriver from selen ...
- Python爬虫系列-PyQuery详解
强大又灵活的网页解析库.如果你觉得正则写起来太麻烦,如果你觉得BeautifulSoup语法太难记,如果你熟悉jQuery的语法,那么PyQuery就是你的最佳选择. 安装 pip3 install ...
- python爬虫scrapy项目详解(关注、持续更新)
python爬虫scrapy项目(一) 爬取目标:腾讯招聘网站(起始url:https://hr.tencent.com/position.php?keywords=&tid=0&st ...
- 爬虫系列---selenium详解
一 安装 pip install Selenium 二 安装驱动 chrome驱动文件:点击下载chromedriver (yueyu下载) 三 配置chromedrive的路径(仅添加环境变量即可) ...
- 使用Python爬虫库BeautifulSoup遍历文档树并对标签进行操作详解(新手必学)
为大家介绍下Python爬虫库BeautifulSoup遍历文档树并对标签进行操作的详细方法与函数下面就是使用Python爬虫库BeautifulSoup对文档树进行遍历并对标签进行操作的实例,都是最 ...
- 反爬虫:利用ASP.NET MVC的Filter和缓存(入坑出坑) C#中缓存的使用 C#操作redis WPF 控件库——可拖动选项卡的TabControl 【Bootstrap系列】详解Bootstrap-table AutoFac event 和delegate的分别 常见的异步方式async 和 await C# Task用法 c#源码的执行过程
反爬虫:利用ASP.NET MVC的Filter和缓存(入坑出坑) 背景介绍: 为了平衡社区成员的贡献和索取,一起帮引入了帮帮币.当用户积分(帮帮点)达到一定数额之后,就会“掉落”一定数量的“帮帮 ...
- python 3.x 爬虫基础---Urllib详解
python 3.x 爬虫基础 python 3.x 爬虫基础---http headers详解 python 3.x 爬虫基础---Urllib详解 前言 爬虫也了解了一段时间了希望在半个月的时间内 ...
- 第7.19节 Python中的抽象类详解:abstractmethod、abc与真实子类
第7.19节 Python中的抽象类详解:abstractmethod.abc与真实子类 一. 引言 前面相关的章节已经介绍过,Python中定义某种类型是以实现了该类型对应的协议为标准的,而不 ...
- python之OS模块详解
python之OS模块详解 ^_^,步入第二个模块世界----->OS 常见函数列表 os.sep:取代操作系统特定的路径分隔符 os.name:指示你正在使用的工作平台.比如对于Windows ...
随机推荐
- 出现提示ERROR 1289 The 'InnoDB' feature is disabled; you need MySQL built with 'InnoDB' to have IT working
关闭mysql数据库 在mysql的安装目录中找到my.ini文件找到skip-innodb,在前面加上#号保存,重启mysql服务 OK.
- JS高级学习历程-16
[正则表达式] 1()小括号使用 作用:① 提高表达式优先级关系 ② 提取子字符串内容 模式单元,每个小括号都算作一个模式单元内容,按照内容的下标可以给小括号计数. var reg = /([0-9 ...
- [Java]三大特性之封装
封装这个我们可以从字面上来理解,简单来说就是包装的意思,专业点就是信息隐藏. 是指利用抽象数据类型将数据和基于数据的操作封装在一起,使其构成一个不可分割的独立实体,数据被保护在抽象数据类型的内部,尽可 ...
- LCA 离线做法tarjan
tarjan(int u) { int v; for(int i=h[u];i;i=nex[i])//搜索边的 { v=to[i]; tarjan(v); marge(u,v); vis[v]=; } ...
- P4874 回形遍历 —模拟
思路: 写完后信心满满,结果超时. 我很不解,下了个数据结果——,z竟然是大于1e10的,跟题目给的不一样啊 原来如此,正解是一行一行的走的... 注意当到两边一样近时,应优先向下和右!!!!!! 这 ...
- XML文件的一些操作
XML 是被设计用来传输和存储数据的, XML 必须含有且仅有一个 根节点元素(没有根节点会报错) 源码下载 http://pan.baidu.com/s/1ge2lpM7 好了,我们 先看一个 XM ...
- Java 2 个 List 集合数据求并、补集操作
开发过程中,我们可能需要对 2 个 或多个 List 集合中的数据进行处理,比如多个 List 集合数据求 相同元素,多个 List 集合数据得到只属于本身的数据,如图示: 这里写图片描述 这里以 2 ...
- asp.net重启web应用程序域
我把加载到static静态变量中了,是在数据库中存的,这样每次改了一下必须要重启一下web应用程序,每次去iis操作太麻烦了,于是找的了这个重启的办法,一句话代码: System.Web.HttpRu ...
- 采用React+Ant Design组件化开发前端界面(一)
react-start 基础知识 1.使用脚手架创建项目并启动 1.1 安装脚手架: npm install -g create-react-app 1.2 使用脚手架创建项目: create ...
- dubbo服务降级(2)
dubbo降级服务 使用dubbo在进行服务调用时,可能由于各种原因(服务器宕机/网络超时/并发数太高等),调用中就会出现RpcException,调用失败. 服务降级就是指在由于非业务异常导致的服务 ...