Beautiful Soup的用法（五）：select的使用

原文地址：http://www.bugingcode.com/blog/beautiful_soup_select.html

select 的功能跟find和find_all 一样用来选取特定的标签，它的选取规则依赖于css，我们把它叫做css选择器，如果之前有接触过jquery ，可以发现select的选取规则和jquery有点像。

通过标签名查找

在进行过滤时标签名不加任何修饰，如下：

from bs4 import BeautifulSoup

import re  

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

</body>

</html>

"""  

soup = BeautifulSoup(html, "lxml")

print soup.select('p')

返回的结果如下：

[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>]

通过结果可以看出，他返回的是一个数组，再继续看看数组里的元素是什么呢？

print type(soup.select('p')[0])

结果为：

<class 'bs4.element.Tag'>

清楚了返回的是bs4.element.Tag，这一点和find_all是一样的，select('p') 返回了所有标签名为p的tag。

通过类名和id进行查找

在进行过滤时类名前加点，id名前加 #

print soup.select('.title')

print soup.select('#link2')

返回的结果为：

[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>]

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通过属性查找

如果不是id或者是类名，是不是就不能进行过滤了？如果可以，该如何来表达，

print soup.select('[href="http://example.com/lacie"]')

选择href 为http://example.com/lacie　的tag。

组合查找

组合查找可以分为两种，一种是在一个tag中进行两个条件的查找，一种是树状的查找一层一层之间的查找。

第一种情况，如下所示：

print soup.select('a#link2')

选择标签名为a，id为link2的tag。

输出的结果如下：

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

另一种情况，如下：

从body开始，在body里面查找所有的 p，在所有的p 中查找标签名为a，id 为link2的tag，这样像树状一层一层的查找，在分析html结构是是非常常见的。层和层之间用空格分开。

print soup.select('body p a#link2')

结果如下：

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

转载请标明来之：http://www.bugingcode.com/

更多教程：阿猫学编程