安装

pip3 install beautifulsoup4

解析库

解析器 使用方法 优势 劣势
Python标准库 BeautifulSoup(markup,'html,parser') Python的内置标准库、执行速度适中、文档容错能力强 Python 2.7.3 or 3.2.2前的版本中文容错能力差
lxml HTML 解析库 BeautifulSoup(markup,'lxml') 速度快、文档容错能力强 需要安装C语言库
lxml XML 解析库 BeautifulSoup(markup,'xml') 速度快、唯一支持XML的解析器 需要安装C语言库
html5lib BeautifulSoup(markup,'xml') 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 速度慢、不依赖外部扩展

基本使用

  1. html = """
  2. <html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head><body><p class="title" name="dormouse"> <b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters;and their names were
  3. <a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
  4. </p> <p class="story"> ...story go on...</p>
  5. """
  6. from bs4 import BeautifulSoup
  7. soup = BeautifulSoup(html,'lxml')
  8. print(soup.prettify()

自动补全代码:

  1. <html dir="ltr" lang="en">
  2. <head>
  3. <meta charset="utf-8"/>
  4. <title>
  5. The Dormouse's story
  6. </title>
  7. </head>
  8. <body>
  9. <p class="title" name="dormouse">
  10. <b>
  11. The Dormouse's story
  12. </b>
  13. </p>
  14. <p class="story">
  15. Once upon a time there were three little sisters;and their names were
  16. <a class="sister" href="http://example.com/elsie" id="link1">
  17. <!-- Elsie -->
  18. </a>
  19. <a class="sister" href="http://example.com/lacie" id="link2">
  20. Lacie
  21. </a>
  22. and
  23. <a class="sister" href="http://example.com/tillie" id="link3">
  24. Tillie
  25. </a>
  26. ; and they lived at the bottom of a well
  27. </p>
  28. <p class="story">
  29. ...story go on...
  30. </p>
  31. </body>
  32. </html>

print(soup.title.string)

输出html的标题:

The Dormouse's story

标签选择器

选择元素

  1. from bs4 import BeautifulSoup
  2. soup = BeautifulSoup(html,'lxml')
  3. print(soup.title)
  4. print(type(soup.title))
  5. print(soup.head)
  6. print(soup.p)

输出结果如下:

  1. <title>The Dormouse's story</title>
  2. <class 'bs4.element.Tag'>
  3. <head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head>
  4. <p class="title" name="dormouse"> <b>The Dormouse's story</b></p> #只返回第一个p标签

获取外层标签的名称

  1. from bs4 import BeautifulSoup
  2. soup = BeautifulSoup(html,'lxml')
  3. print(soup.title.name)

title

获取内容的属性

  1. from bs4 import BeautifulSoup
  2. soup = BeautifulSoup(html,'lxml')
  3. print(soup.p.attrs['name'])
  4. print(soup.p['name'])

两种获取属性名称的方法

dormouse

dormouse

获取内容

  1. from bs4 import BeautifulSoup
  2. soup = BeautifulSoup(html,'lxml')
  3. print(soup.b.string)

The Dormouse's story

嵌套选择

  1. from bs4 import BeautifulSoup
  2. soup = BeautifulSoup(html,'lxml')
  3. print(soup.head.title.string)

The Dormouse's story

字节点和子孙节点

  1. html = '''<html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse\'s story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were\n <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well\n </p> <p class="story"> ...story go on...</p>
  2. '''
  3. from bs4 import BeautifulSoup
  4. soup = BeautifulSoup(html,'lxml')
  5. print(soup.p.contents)
  1. ['Once upon a time there were three little sisters;and their names were\n ', <a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 'and', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '; and they lived at the bottom of a well\n ']

children是一个迭代器:

  1. from bs4 import BeautifulSoup
  2. soup = BeautifulSoup(html,'lxml')
  3. print(soup.p.children)
  4. for i,child in enumerate(soup.p.children):
  5. print(i,child)

<list_iterator object at 0x7fe986ba07f0>

0 Once upon a time there were three little sisters;and their names were

1 <a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --></a>

2 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

3 and

4 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

5 ; and they lived at the bottom of a well

  1. html = '''<html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse\'s story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were\n <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well\n </p> <p class="story"> ...story go on...</p>
  2. ... '''
  3. from bs4 import BeautifulSoup
  4. soup = BeautifulSoup(html,'lxml')
  5. print(soup.p.descendants)
  6. for i,child in enumerate(soup.p.descendants):
  7. print(i,child)

孙节点也被输出出来:

<generator object descendants at 0x7fe986c11468>

0 Once upon a time there were three little sisters;and their names were

1 <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a>

2

3 <span>Elsie </span>

4 Elsie

5 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

6 Lacie

7 and

8 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

9 Tillie

10 ; and they lived at the bottom of a well

父节点和祖先节点

  1. from bs4 import BeautifulSoup
  2. soup = BeautifulSoup(html,'lxml')
  3. print(soup.a.parent)

显示结果:

  1. <p class="story">Once upon a time there were three little sisters;and their names were
  2. <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
  3. </p>
  1. from bs4 import BeautifulSoup
  2. soup = BeautifulSoup(html,'lxml')
  3. print(list(enumerate(soup.a.parent)))

显示结果:

  1. [(0, 'Once upon a time there were three little sisters;and their names were\n '), (1, <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a>), (2, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (3, 'and'), (4, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (5, '; and they lived at the bottom of a well\n ')]

print(list(enumerate(soup.a.parents)))

显示所有结果:最后为源代码跟节点

  1. [(0, <p class="story">Once upon a time there were three little sisters;and their names were
  2. <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
  3. </p>), (1, <body><p class="story">Once upon a time there were three little sisters;and their names were
  4. <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
  5. </p> <p class="story"> ...story go on...</p>
  6. </body>), (2, <html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were
  7. <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
  8. </p> <p class="story"> ...story go on...</p>
  9. </body></html>), (3, <html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were
  10. <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
  11. </p> <p class="story"> ...story go on...</p>
  12. </body></html>)]

兄弟节点

  1. from bs4 import BeautifulSoup
  2. soup = BeautifulSoup(html,'lxml')
  3. print(list(enumerate(soup.a.next_siblings)))

显示如下:```html

[(0, Lacie), (1, 'and'), (2, Tillie), (3, '; and they lived at the bottom of a well\n ')]

  1. `print(list(enumerate(soup.a.previous_siblings)))`
  2. > `[(0, 'Once upon a time there were three little sisters;and their names were\n ')]`
  3. ## 标准选择器
  4. ### find_all(name,attrs,recursive,text,**kwargs)
  5. 可根据标签名、属性、内容查找文档
  6. #### name
  7. ```py
  8. html = """
  9. <div class="panel">
  10. <div class="panel-heading">
  11. <h4>Helllo</h4>
  12. </div>
  13. <div class="panel-body">
  14. <ul class="list" id="list-1">
  15. <li class="element">Foo</li>
  16. <li class="element">Bar</li>
  17. <li class="element">Jay</li>
  18. </ul>
  19. <ul class="list list-small" id="list-2">
  20. <li class="element">Foo</li>
  21. <li class="element">Bar</li>
  22. </ul>
  23. </div>
  24. </div>
  25. """
  26. from bs4 import BeautifulSoup
  27. soup = BeautifulSoup(html,'lxml')
  28. print(soup.find_all('ul'))
  29. print(type(soup.find_all('ul')[0]))

显示结果如下:

[

  • Foo
  • Bar
  • Jay

,

  • Foo
  • Bar

]
```
>
```py
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.find_all('ul'):
print(ul.find_all('li'))
```
显示结果如下
```html
[

  • Foo
  • ,

  • Bar
  • ,

  • Jay
  • ]
    [

  • Foo
  • ,

  • Bar
  • ]
    ```

    attrs

    1. html = '''
    2. <div class="panel">\n <div class="panel-heading">\n <h4>Helllo</h4>\n </div>\n <div class="panel-body">\n <ul class="list" id="list-1" name=elements>\n <li class="element">Foo</li>\n <li class="element">Bar</li>\n <li class="element">Jay</li>\n </ul>\n <ul class="list list-small" id="list-2">\n <li class="element">Foo</li>\n <li class="element">Bar</li>\n </ul>\n </div>\n</div>
    3. '''
    4. from bs4 import BeautifulSoup
    5. soup = BeautifulSoup(html,'lxml')
    6. print(soup.find_all(attrs={'id':'list-1'}))
    7. print(soup.find_all(attrs={'name':'elements'}))

    显示如下:

    1. [<ul class="list" id="list-1" name="elements">
    2. <li class="element">Foo</li>
    3. <li class="element">Bar</li>
    4. <li class="element">Jay</li>
    5. </ul>]
    1. [<ul class="list" id="list-1" name="elements">
    2. <li class="element">Foo</li>
    3. <li class="element">Bar</li>
    4. <li class="element">Jay</li>
    5. </ul>]

    另外知道ID或Class可以用下列方法查找:

    1. from bs4 import BeautifulSoup
    2. soup = BeautifulSoup(html,'lxml')
    3. print(soup.find_all(id='list-1'))
    1. [<ul class="list" id="list-1" name="elements">
    2. <li class="element">Foo</li>
    3. <li class="element">Bar</li>
    4. <li class="element">Jay</li>
    5. </ul>]

    print(soup.find_all(class_='element'))

    1. [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]

    text

    1. from bs4 import BeautifulSoup
    2. soup = BeautifulSoup(html,'lxml')
    3. print(soup.find_all(text='Foo'))

    ['Foo', 'Foo']

    find(name,attrs,recursive,text,**kwargs)

    find返回单个元素,find_all返回所有元素

    1. from bs4 import BeautifulSoup
    2. soup = BeautifulSoup(html,'lxml')
    3. print(soup.find('ul'))
    • Foo
    • Bar
    • Jay

    ```

    print(type(soup.find('ul')))

    <class 'bs4.element.Tag'>

    print(type(soup.find('page')))不存在返回结果:

    <class 'NoneType'>

    CSS选择器

    通过select()直接传入CSS选择器即可完成选择

    1. from bs4 import BeautifulSoup
    2. soup = BeautifulSoup(html,'lxml')
    3. print(soup.select('.panel .panel-heading'))
    4. print(soup.select('ul li'))
    5. print(soup.select('#list-2 .element'))
    6. print(soup.select('ul')[0])

    显示结果如下:

    [```html

    Helllo

    ```]

    1. [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
    2. [<li class="element">Foo</li>, <li class="element">Bar</li>]
    1. <ul class="list" id="list-1" name="elements">
    2. <li class="element">Foo</li>
    3. <li class="element">Bar</li>
    4. <li class="element">Jay</li>
    5. </ul>

    遍历的用法:

    1. from bs4 import BeautifulSoup
    2. soup = BeautifulSoup(html,'lxml')
    3. for ul in soup.select('ul'):
    4. print(ul.select('li'))

    显示结果如下:

    1. [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
    2. [<li class="element">Foo</li>, <li class="element">Bar</li>]

    获取属性

    1. from bs4 import BeautifulSoup
    2. soup = BeautifulSoup(html,'lxml')
    3. for ul in soup.select('ul'):
    4. print(ul['id'])
    5. print(ul.attrs['id'])

    显示效果如下:

    list-1

    list-1

    list-2

    list-2

    获取内容

    1. from bs4 import BeautifulSoup
    2. soup = BeautifulSoup(html,'lxml')
    3. for li in soup.select('li'):
    4. print(li.get_text())

    显示结果:

    Foo

    Bar

    Jay

    Foo

    Bar

    总结:

    • 推荐使用lxml解析库,必要时使用html.parser
    • 标签选择筛选功能弱但是速度快
    • 建议使用find()、find_all()查询匹配单个结果或多个结果
    • 如果对CSS选择器书系建议使用select()
    • 记住常用的获取属性和文本值的方法

    Python爬虫系列-BeautifulSoup详解的更多相关文章

    1. Python爬虫系列-Selenium详解

      自动化测试工具,支持多种浏览器.爬虫中主要用来解决JavaScript渲染的问题. 用法讲解 模拟百度搜索网站过程: from selenium import webdriver from selen ...

    2. Python爬虫系列-PyQuery详解

      强大又灵活的网页解析库.如果你觉得正则写起来太麻烦,如果你觉得BeautifulSoup语法太难记,如果你熟悉jQuery的语法,那么PyQuery就是你的最佳选择. 安装 pip3 install ...

    3. python爬虫scrapy项目详解(关注、持续更新)

      python爬虫scrapy项目(一) 爬取目标:腾讯招聘网站(起始url:https://hr.tencent.com/position.php?keywords=&tid=0&st ...

    4. 爬虫系列---selenium详解

      一 安装 pip install Selenium 二 安装驱动 chrome驱动文件:点击下载chromedriver (yueyu下载) 三 配置chromedrive的路径(仅添加环境变量即可) ...

    5. 使用Python爬虫库BeautifulSoup遍历文档树并对标签进行操作详解(新手必学)

      为大家介绍下Python爬虫库BeautifulSoup遍历文档树并对标签进行操作的详细方法与函数下面就是使用Python爬虫库BeautifulSoup对文档树进行遍历并对标签进行操作的实例,都是最 ...

    6. 反爬虫:利用ASP.NET MVC的Filter和缓存(入坑出坑) C#中缓存的使用 C#操作redis WPF 控件库——可拖动选项卡的TabControl 【Bootstrap系列】详解Bootstrap-table AutoFac event 和delegate的分别 常见的异步方式async 和 await C# Task用法 c#源码的执行过程

      反爬虫:利用ASP.NET MVC的Filter和缓存(入坑出坑)   背景介绍: 为了平衡社区成员的贡献和索取,一起帮引入了帮帮币.当用户积分(帮帮点)达到一定数额之后,就会“掉落”一定数量的“帮帮 ...

    7. python 3.x 爬虫基础---Urllib详解

      python 3.x 爬虫基础 python 3.x 爬虫基础---http headers详解 python 3.x 爬虫基础---Urllib详解 前言 爬虫也了解了一段时间了希望在半个月的时间内 ...

    8. 第7.19节 Python中的抽象类详解:abstractmethod、abc与真实子类

      第7.19节 Python中的抽象类详解:abstractmethod.abc与真实子类 一.    引言 前面相关的章节已经介绍过,Python中定义某种类型是以实现了该类型对应的协议为标准的,而不 ...

    9. python之OS模块详解

      python之OS模块详解 ^_^,步入第二个模块世界----->OS 常见函数列表 os.sep:取代操作系统特定的路径分隔符 os.name:指示你正在使用的工作平台.比如对于Windows ...

    随机推荐

    1. Corn Fields(模板)

      题目链接 #include <stdio.h> #include <algorithm> #include <string.h> #include <iost ...

    2. Count the string (KMP+DP)

      题目链接 #include <bits/stdc++.h> using namespace std; typedef long long ll; inline int read() { , ...

    3. Java NIO 必知必会(Example)

      管道流: Java NIO 管道是2个线程之间的单向数据连接.Pipe有一个source通道和一个sink通道.数据会被写到sink通道,从source通道读取. package base.nio.t ...

    4. C# Dictionary类型转json方法之一

      using Newtonsoft.Json;//引用命名空间 Dictionary<string, string> Content = new Dictionary<string, ...

    5. 前端CSS(3)

      前端基础CSS(3)   一.文本属性和字体属性(常用的) 1.文本属性 text-align:left|right|center|justify(两端对齐,只适用于英文);   /*对齐方式*/ c ...

    6. (转)Linux下java进程CPU占用率高-分析方法

      Linux下java进程CPU占用率高-分析方法 原文:http://itindex.net/detail/47420-linux-java-%E8%BF%9B%E7%A8%8B?utm_source ...

    7. linux中c表示字符设备文件符号

      linux中c表示字符设备文件,b表示块设备文件,l表示符号链接文件,r表示可读权限,w表示可写权限.linux文件属性解读:文件类型:-:普通文件 (f)d:目录文件b:块设备文件 (block)c ...

    8. 《从0到1学习Flink》—— 如何自定义 Data Sink ?

      前言 前篇文章 <从0到1学习Flink>-- Data Sink 介绍 介绍了 Flink Data Sink,也介绍了 Flink 自带的 Sink,那么如何自定义自己的 Sink 呢 ...

    9. ribbon重试机制

      我们使用Spring Cloud Ribbon实现客户端负载均衡的时候,通常都会利用@LoadBalanced来让RestTemplate具备客户端负载功能,从而实现面向服务名的接口访问. 下面的例子 ...

    10. HoloLens | 世界的每一次变化,其实都提前打好了招呼

      新年,对灯发誓——不说老话,说新鲜事. 佛经上说:世间唯一永恒不变的,就是永远在变化. 130年前(说好的不说老话呢),世界上第一辆汽车在德国发出第一声轰鸣,世界变了: 现在,汽车已遍及世界,颜值.性 ...