爬虫——BeautifulSoup4解析器

BeautifulSoup用来解析HTML比较简单，API非常人性化，支持CSS选择器、Python标准库中的HTML解析器，也支持lxml的XML解析器。

其相较与正则而言，使用更加简单。

示例：

首先必须要导入bs4库

#!/usr/bin/python3

# -*- coding:utf-8 -*-

__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

# 创建 Beautiful Soup 对象，指定lxml解析器

soup = BeautifulSoup(html, "lxml")

# 格式化输出 soup 对象的内容

print(soup.prettify())

运行结果

<html>

 <head>

  <title>

   The Dormouse's story

  </title>

 </head>

 <body>

  <p class="title" name="dromouse">

   <b>

    The Dormouse's story

   </b>

  </p>

  <p class="story">

   Once upon a time there were three little sisters; and their names were

   <a class="sister" href="http://example.com/elsie" id="link1">

    <!-- Elsie -->

   </a>

   ,

   <a class="sister" href="http://example.com/lacie" id="link2">

    Lacie

   </a>

   and

   <a class="sister" href="http://example.com/tillie" id="link3">

    Tillie

   </a>

   ;

and they lived at the bottom of a well.

  </p>

  <p class="story">

   ...

  </p>

 </body>

</html>

四大对象种类

BeautifulSoup将复杂的HTML文档转换成一个复杂的树形结构，每个节点都是Python对象，所有对象可以归纳为4种：

Tag
NavigableString
BeautifulSoup
Comment

1.Tag

Tag 通俗点讲就是HTML中的一个个标签，例如：

<head><title>The Dormouse's story</title></head>

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

上面title head a p 等等HTML标签加上里面包括的内容就是Tag，那么试着使用BeautifulSoup来获取Tags：

#!/usr/bin/python3

# -*- coding:utf-8 -*-

__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

# 创建 Beautiful Soup 对象，指定lxml解析器

soup = BeautifulSoup(html, "lxml")

# # 打印title标签

print(soup.title)

# 打印head标签

print(soup.head)

# 打印a标签

print(soup.a)

# 打印p标签

print(soup.p)

# 打印soup.p的类型

print(type(soup.p))

运行结果

<title>The Dormouse's story</title>

<head><title>The Dormouse's story</title></head>

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<class 'bs4.element.Tag'>

我们可以利用soup加标签名轻松地获取这些标签内容，这些对象的类型是bs4.element.Tag。但是注意，它查找的是在所有内容中的第一个符合要求的标签。如果需要查询所有的标签，后面会进行介绍。

对于Tag，它有两个重要的属性，就是name和attrs

#!/usr/bin/python3

# -*- coding:utf-8 -*-

__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

# 创建 Beautiful Soup 对象，指定lxml解析器

soup = BeautifulSoup(html, "lxml")

# soup对象比较特殊，它的name为[document]

print(soup.name)

# 对于其他内部标签，输出的值便为标签本身的名称

print(soup.head.name)

# 打印p标签的所有属性，其类型是一个字典

print(soup.p.attrs)

# 打印p标签的class属性

print(soup.p['class'])

# 还可以利用get方法获取属性，传入属性的名称，与上面的方法等价

print(soup.p.get('class'))

print(soup.p)

# 修改属性

soup.p['class'] = "newClass"

print(soup.p)

# 删除属性

del soup.p['class']

print(soup.p)

运行结果

[document]

head

{'class': ['title'], 'name': 'dromouse'}

['title']

['title']

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>

<p name="dromouse"><b>The Dormouse's story</b></p>

2.NavigableString

既然我们已经得到了标签的内容，那么问题来了，我们想要获取标签内部的文字怎么办呢？很简单，用.string即可，例如：

#!/usr/bin/python3

# -*- coding:utf-8 -*-

__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

# 创建 Beautiful Soup 对象，指定lxml解析器

soup = BeautifulSoup(html, "lxml")

# 打印p标签的内容

print(soup.p.string)

# 打印soup.p.string的类型

print(type(soup.p.string))

运行结果

The Dormouse's story

<class 'bs4.element.NavigableString'>

3.BeautifulSoup

BeautifulSoup对象表示的是一个文档的内容。大部分时候，可以把它当作Tag对象，是一个特殊的Tag，我们可以分别获取它的类型，名称，以及属性

#!/usr/bin/python3

# -*- coding:utf-8 -*-

__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

# 创建 Beautiful Soup 对象，指定lxml解析器

soup = BeautifulSoup(html, "lxml")

# 类型

print(type(soup.name))

# 名称

print(soup.name)

# 属性

print(soup.attrs)

运行结果

<class 'str'>

[document]

{}

4.Comment

Comment对象是一个特殊类型的NavigableString对象，其输出的内容不包括注释符号。

#!/usr/bin/python3

# -*- coding:utf-8 -*-

__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

# 创建 Beautiful Soup 对象，指定lxml解析器

soup = BeautifulSoup(html, "lxml")

print(soup.a)

print(soup.a.string)

print(type(soup.a.string))

运行结果

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

 Elsie

<class 'bs4.element.Comment'>

a标签里的内容实际上是注释，但是如果我们利用.string来输出它的内容时，注释符号已经去掉了。

遍历文档树

1.直接子节点：.contents .children属性

.content

Tag的.content属性可以将Tag的子节点以列表的方式输出

#!/usr/bin/python3

# -*- coding:utf-8 -*-

__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

# 创建 Beautiful Soup 对象，指定lxml解析器

soup = BeautifulSoup(html, "lxml")

# 输出方式为列表

print(soup.head.contents)

print(soup.head.contents[0])

运行结果

[<title>The Dormouse's story</title>]

<title>The Dormouse's story</title>

.children

它返回的不是一个列表，不过我们可以通过遍历获取所有的子节点。

#!/usr/bin/python3

# -*- coding:utf-8 -*-

__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

# 创建 Beautiful Soup 对象，指定lxml解析器

soup = BeautifulSoup(html, "lxml")

# 输出方式为列表生成器对象

print(soup.head.children)

# 通过遍历获取所有子节点

for child in soup.head.children:

    print(child)

运行结果

<list_iterator object at 0x008FF950>

<title>The Dormouse's story</title>

2.所有子孙节点：.descendants属性

上面讲的.contents和.children属性仅包含Tag的直接子节点，.descendants属性可以对所有Tag的子孙节点进行递归循环，和children类似，我们也需要通过遍历的方式获取其中的内容。

#!/usr/bin/python3

# -*- coding:utf-8 -*-

__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

# 创建 Beautiful Soup 对象，指定lxml解析器

soup = BeautifulSoup(html, "lxml")

# 输出方式为列表生成器对象

print(soup.head.descendants)

# 通过遍历获取所有子孙节点

for child in soup.head.descendants:

    print(child)

运行结果

<generator object descendants at 0x00519AB0>

<title>The Dormouse's story</title>

The Dormouse's story

3.节点内容：.string属性

如果Tag只有一个NavigableString类型子节点，那么这个Tag可以使用.string得到子节点。如果一个Tag仅有一个子节点，那么这个Tab也可以使用.string方法，输出结果与当前唯一子节点的.string结果相同。

通俗点来讲就是：如果一个标签里面没有标签了，那么.string就会返回标签里面的内容。如果标签里面只有唯一的一个标签了，那么.string也会返回里面的内容。例如：

#!/usr/bin/python3

# -*- coding:utf-8 -*-

__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

# 创建 Beautiful Soup 对象，指定lxml解析器

soup = BeautifulSoup(html, "lxml")

print(soup.head.string)

print(soup.head.title.string)

运行结果

The Dormouse's story

The Dormouse's story

搜索文档树

1.find_all(name, attrs, recursive, text, **kwargs)

1)name参数

name参数可以查找所有名字为name的Tag，字符串对象会被自动忽略掉

a.传字符串

最简单的过滤器就是字符串，在搜索方法中传入一个字符串参数，Beautiful Soup会查找与字符串完整匹配所有的内容，返回一个列表。

#!/usr/bin/python3

# -*- coding:utf-8 -*-

__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

# 创建 Beautiful Soup 对象，指定lxml解析器

soup = BeautifulSoup(html, "lxml")

print(soup.find_all("b"))

print(soup.find_all("a"))

运行结果

[<b>The Dormouse's story</b>]

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

B.传正则表达式

如果传入正则表达式作为参数，Beautiful Soup会通过正则表达式match()来匹配内容

#!/usr/bin/python3

# -*- coding:utf-8 -*-

__author__ = 'mayi'

from bs4 import BeautifulSoup

import re

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

# 创建 Beautiful Soup 对象，指定lxml解析器

soup = BeautifulSoup(html, "lxml")

for tag in soup.find_all(re.compile("^b")):

    print(tag.name)

运行结果

body

b

C.传列表

如果传入列表参数，Beautiful Soup会将与列表中任一元素匹配的内容以列表方式返回

#!/usr/bin/python3

# -*- coding:utf-8 -*-

__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

# 创建 Beautiful Soup 对象，指定lxml解析器

soup = BeautifulSoup(html, "lxml")

print(soup.find_all(['a', 'b']))

2)keyword参数

#!/usr/bin/python3

# -*- coding:utf-8 -*-

__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

# 创建 Beautiful Soup 对象，指定lxml解析器

soup = BeautifulSoup(html, "lxml")

print(soup.find_all(id="link1"))

运行结果

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

3)text参数

通过text参数可以搜索文档中的字符串内容，与name参数的可选值一样，text参数接受字符串，正则表达式，列表

#!/usr/bin/python3

# -*- coding:utf-8 -*-

__author__ = 'mayi'

from bs4 import BeautifulSoup

import re

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

# 创建 Beautiful Soup 对象，指定lxml解析器

soup = BeautifulSoup(html, "lxml")

# 字符串

print(soup.find_all(text = " Elsie "))

# 列表

print(soup.find_all(text = ["Tillie", " Elsie ", "Lacie"]))

# 正则表达式

print(soup.find_all(text = re.compile("Dormouse")))

运行结果

[' Elsie ']

[' Elsie ', 'Lacie', 'Tillie']

["The Dormouse's story", "The Dormouse's story"]

CSS选择器

这是另一种与find_all()方法有异曲同工的查找方法

写CSS时，标签名不加任何修饰，类名前加.，id名前加#
在这里我们也可以利用类似的方法来筛选元素，用到的方法是soup.select()，返回的类型是list

（1）通过标签名查找

#!/usr/bin/python3

# -*- coding:utf-8 -*-

__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

# 创建 Beautiful Soup 对象，指定lxml解析器

soup = BeautifulSoup(html, "lxml")

print(soup.select("title"))

print(soup.select("b"))

print(soup.select("a"))

运行结果

[<title>The Dormouse's story</title>]

[<b>The Dormouse's story</b>]

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

（2）通过类名查找

#!/usr/bin/python3

# -*- coding:utf-8 -*-

__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

# 创建 Beautiful Soup 对象，指定lxml解析器

soup = BeautifulSoup(html, "lxml")

print(soup.select(".title"))

运行结果

[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>]

（3）通过id名查找

#!/usr/bin/python3

# -*- coding:utf-8 -*-

__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

# 创建 Beautiful Soup 对象，指定lxml解析器

soup = BeautifulSoup(html, "lxml")

print(soup.select("#link1"))

运行结果

[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>]

（4）组合查找

#!/usr/bin/python3

# -*- coding:utf-8 -*-

__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

# 创建 Beautiful Soup 对象，指定lxml解析器

soup = BeautifulSoup(html, "lxml")

print(soup.select("p #link1"))

运行结果

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

（5）属性查找

查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。

#!/usr/bin/python3

# -*- coding:utf-8 -*-

__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

# 创建 Beautiful Soup 对象，指定lxml解析器

soup = BeautifulSoup(html, "lxml")

print(soup.select("a[class='sister']"))

运行结果

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

同样，属性仍然可以与上述查找方式组合，不在同一节点的空格隔开，同一节点的不加空格

#!/usr/bin/python3

# -*- coding:utf-8 -*-

__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

# 创建 Beautiful Soup 对象，指定lxml解析器

soup = BeautifulSoup(html, "lxml")

print(soup.select("p a[class='sister']"))

运行结果

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

（6）获取内容

以上的select()方法返回的结果都是列表形式，可以遍历形式输出，然后用get_text()方法来获取它的内容

#!/usr/bin/python3

# -*- coding:utf-8 -*-

__author__ = 'mayi'

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

# 创建 Beautiful Soup 对象，指定lxml解析器

soup = BeautifulSoup(html, "lxml")

print(soup.select("p a[class='sister']"))

for item in soup.select("p a[class='sister']"):

    print(item.get_text())

运行结果

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

Lacie

Tillie

注意：为注释内容，未输出