Python爬虫系列-BeautifulSoup详解
安装
pip3 install beautifulsoup4
解析库
解析器 | 使用方法 | 优势 | 劣势 |
---|---|---|---|
Python标准库 | BeautifulSoup(markup,'html,parser') | Python的内置标准库、执行速度适中、文档容错能力强 | Python 2.7.3 or 3.2.2前的版本中文容错能力差 |
lxml HTML 解析库 | BeautifulSoup(markup,'lxml') | 速度快、文档容错能力强 | 需要安装C语言库 |
lxml XML 解析库 | BeautifulSoup(markup,'xml') | 速度快、唯一支持XML的解析器 | 需要安装C语言库 |
html5lib | BeautifulSoup(markup,'xml') | 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 | 速度慢、不依赖外部扩展 |
基本使用
html = """
<html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head><body><p class="title" name="dormouse"> <b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
</p> <p class="story"> ...story go on...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.prettify()
自动补全代码:
<html dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title" name="dormouse">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<!-- Elsie -->
</a>
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
; and they lived at the bottom of a well
</p>
<p class="story">
...story go on...
</p>
</body>
</html>
print(soup.title.string)
输出html的标题:
The Dormouse's story
标签选择器
选择元素
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)
输出结果如下:
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
<head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head>
<p class="title" name="dormouse"> <b>The Dormouse's story</b></p> #只返回第一个p标签
获取外层标签的名称
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title.name)
title
获取内容的属性
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])
两种获取属性名称的方法
dormouse
dormouse
获取内容
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.b.string)
The Dormouse's story
嵌套选择
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.head.title.string)
The Dormouse's story
字节点和子孙节点
html = '''<html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse\'s story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were\n <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well\n </p> <p class="story"> ...story go on...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)
['Once upon a time there were three little sisters;and their names were\n ', <a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 'and', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '; and they lived at the bottom of a well\n ']
children是一个迭代器:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.children)
for i,child in enumerate(soup.p.children):
print(i,child)
<list_iterator object at 0x7fe986ba07f0>
0 Once upon a time there were three little sisters;and their names were
1<a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --></a>
2<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
3 and
4<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
5 ; and they lived at the bottom of a well
html = '''<html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse\'s story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were\n <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well\n </p> <p class="story"> ...story go on...</p>
... '''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):
print(i,child)
孙节点也被输出出来:
<generator object descendants at 0x7fe986c11468>
0 Once upon a time there were three little sisters;and their names were
1<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a>
2
3<span>Elsie </span>
4 Elsie
5<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
6 Lacie
7 and
8<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
9 Tillie
10 ; and they lived at the bottom of a well
父节点和祖先节点
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.a.parent)
显示结果:
<p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
</p>
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.parent)))
显示结果:
[(0, 'Once upon a time there were three little sisters;and their names were\n '), (1, <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a>), (2, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (3, 'and'), (4, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (5, '; and they lived at the bottom of a well\n ')]
print(list(enumerate(soup.a.parents)))
显示所有结果:最后为源代码跟节点
[(0, <p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
</p>), (1, <body><p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
</p> <p class="story"> ...story go on...</p>
</body>), (2, <html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
</p> <p class="story"> ...story go on...</p>
</body></html>), (3, <html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
</p> <p class="story"> ...story go on...</p>
</body></html>)]
兄弟节点
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.next_siblings)))
显示如下:```html
[(0, Lacie), (1, 'and'), (2, Tillie), (3, '; and they lived at the bottom of a well\n ')]
`print(list(enumerate(soup.a.previous_siblings)))`
> `[(0, 'Once upon a time there were three little sisters;and their names were\n ')]`
## 标准选择器
### find_all(name,attrs,recursive,text,**kwargs)
可根据标签名、属性、内容查找文档
#### name
```py
html = """
<div class="panel">
<div class="panel-heading">
<h4>Helllo</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))
显示结果如下:
[
- Foo
- Bar
- Jay
,
- Foo
- Bar
]
```
>
```py
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.find_all('ul'):
print(ul.find_all('li'))
```
显示结果如下
```html
[
,
,
]
[
,
]
```
attrs
html = '''
<div class="panel">\n <div class="panel-heading">\n <h4>Helllo</h4>\n </div>\n <div class="panel-body">\n <ul class="list" id="list-1" name=elements>\n <li class="element">Foo</li>\n <li class="element">Bar</li>\n <li class="element">Jay</li>\n </ul>\n <ul class="list list-small" id="list-2">\n <li class="element">Foo</li>\n <li class="element">Bar</li>\n </ul>\n </div>\n</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(attrs={'id':'list-1'}))
print(soup.find_all(attrs={'name':'elements'}))
显示如下:
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
另外知道ID或Class可以用下列方法查找:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(id='list-1'))
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
print(soup.find_all(class_='element'))
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
text
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(text='Foo'))
['Foo', 'Foo']
find(name,attrs,recursive,text,**kwargs)
find返回单个元素,find_all返回所有元素
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find('ul'))
- Foo
- Bar
- Jay
```
print(type(soup.find('ul')))
<class 'bs4.element.Tag'>
print(type(soup.find('page')))
不存在返回结果:
<class 'NoneType'>
CSS选择器
通过select()直接传入CSS选择器即可完成选择
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(soup.select('ul')[0])
显示结果如下:
[```html
Helllo
```]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
遍历的用法:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
print(ul.select('li'))
显示结果如下:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
获取属性
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
print(ul['id'])
print(ul.attrs['id'])
显示效果如下:
list-1
list-1
list-2
list-2
获取内容
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for li in soup.select('li'):
print(li.get_text())
显示结果:
Foo
Bar
Jay
Foo
Bar
总结:
- 推荐使用lxml解析库,必要时使用html.parser
- 标签选择筛选功能弱但是速度快
- 建议使用find()、find_all()查询匹配单个结果或多个结果
- 如果对CSS选择器书系建议使用select()
- 记住常用的获取属性和文本值的方法
Python爬虫系列-BeautifulSoup详解的更多相关文章
- Python爬虫系列-Selenium详解
自动化测试工具,支持多种浏览器.爬虫中主要用来解决JavaScript渲染的问题. 用法讲解 模拟百度搜索网站过程: from selenium import webdriver from selen ...
- Python爬虫系列-PyQuery详解
强大又灵活的网页解析库.如果你觉得正则写起来太麻烦,如果你觉得BeautifulSoup语法太难记,如果你熟悉jQuery的语法,那么PyQuery就是你的最佳选择. 安装 pip3 install ...
- python爬虫scrapy项目详解(关注、持续更新)
python爬虫scrapy项目(一) 爬取目标:腾讯招聘网站(起始url:https://hr.tencent.com/position.php?keywords=&tid=0&st ...
- 爬虫系列---selenium详解
一 安装 pip install Selenium 二 安装驱动 chrome驱动文件:点击下载chromedriver (yueyu下载) 三 配置chromedrive的路径(仅添加环境变量即可) ...
- 使用Python爬虫库BeautifulSoup遍历文档树并对标签进行操作详解(新手必学)
为大家介绍下Python爬虫库BeautifulSoup遍历文档树并对标签进行操作的详细方法与函数下面就是使用Python爬虫库BeautifulSoup对文档树进行遍历并对标签进行操作的实例,都是最 ...
- 反爬虫:利用ASP.NET MVC的Filter和缓存(入坑出坑) C#中缓存的使用 C#操作redis WPF 控件库——可拖动选项卡的TabControl 【Bootstrap系列】详解Bootstrap-table AutoFac event 和delegate的分别 常见的异步方式async 和 await C# Task用法 c#源码的执行过程
反爬虫:利用ASP.NET MVC的Filter和缓存(入坑出坑) 背景介绍: 为了平衡社区成员的贡献和索取,一起帮引入了帮帮币.当用户积分(帮帮点)达到一定数额之后,就会“掉落”一定数量的“帮帮 ...
- python 3.x 爬虫基础---Urllib详解
python 3.x 爬虫基础 python 3.x 爬虫基础---http headers详解 python 3.x 爬虫基础---Urllib详解 前言 爬虫也了解了一段时间了希望在半个月的时间内 ...
- 第7.19节 Python中的抽象类详解:abstractmethod、abc与真实子类
第7.19节 Python中的抽象类详解:abstractmethod.abc与真实子类 一. 引言 前面相关的章节已经介绍过,Python中定义某种类型是以实现了该类型对应的协议为标准的,而不 ...
- python之OS模块详解
python之OS模块详解 ^_^,步入第二个模块世界----->OS 常见函数列表 os.sep:取代操作系统特定的路径分隔符 os.name:指示你正在使用的工作平台.比如对于Windows ...
随机推荐
- thinkphp5修改入口文件位置及相应的问题
问题1:thinkphp5修改入口文件 解决:参考手册 http://www.kancloud.cn/manual/thinkphp5/129746,然后需要把.htaccess跟入口文件放到同一目录 ...
- 架构sass文件
sass/ | |– base/ | |– _reset.scss # Reset/normalize | |– _typography.scss # Typography rules | ... # ...
- .NET 基础 一步步 一幕幕[XML基础操作]
XML可扩展标记语言,标准通用标记语言的子集,是一种用于标记电子文件使其具有结构性的标记语言. 什么是XML,学他有什么用? 优点:容易读懂,格式标准任何语言都内置了XML分析引擎,不用单独进行文件分 ...
- java Integer
Java 中的数据类型分为基本数据类型和引用数据类型 int是基本数据类型,Integer是引用数据类型: Ingeter是int的包装类,int的初值为0,Ingeter的初值为null. 初始化 ...
- Qt5.7中使用MySQL Driver(需要把libmysql.dll文件拷贝到Qt的bin目录中。或者自己编译的时候,链接静态库)
Qt5.7中使用MySQL Driver 1.使用环境 Qt5.7的安装安装就已经带了MySQL Driver,只需要在安装的时候选择一下即可.如果没有安装,可以采取自己编译的方式.在Qt的源码包的q ...
- 老男孩IT教育-每日一题汇总
老男孩IT教育-每日一题汇总 第几天 第几周 日期 快速访问链接 第123天 第二十五周 2017年8月25日 出现Swap file….already exists以下错误如何解决? 第122天 2 ...
- POJ 3735 Training little cats 矩阵快速幂
http://poj.org/problem?id=3735 给定一串操作,要这个操作连续执行m次后,最后剩下的值. 记矩阵T为一次操作后的值,那么T^m就是执行m次的值了.(其实这个还不太理解,但是 ...
- (转)Centos7安装配置NFS服务和挂载
Centos7安装配置NFS服务和挂载 原文:https://www.u22e.com/601.html NFS简介 NFS(Network File System)即网络文件系统,是FreeBSD支 ...
- 关于IT公司招聘的一个思考
作者:朱金灿 来源:http://blog.csdn.net/clever101 21世纪什么最贵?人才!相信这是很多IT公司管理者的深刻感悟.对于IT公司而言,找到合适的人才往往不能单靠人事部门,一 ...
- 微信成为HTML5技术流行的最大推手
很多热点的事件都是厚积薄发,HTML5就是如此.此前iOS和Android系统已经放弃了Flash,这让HTML5有了一个天然的成长基础.而现在手机硬件的提升和HTML5本身的完善,使得基于HTML5 ...