吴裕雄--天生自然python学习笔记：beautifulsoup库的使用

Beautiful Soup 库简介

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。

Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(html,'lxml')

print(soup.title)#选择title标签中的内容

print(type(soup.title))#title标签的类型

print(soup.head)#head标签中的内容

print(soup.p)#第一个p标签中的内容

<title>The Dormouse's story</title>

<class 'bs4.element.Tag'>

<head><title>The Dormouse's story</title></head>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

soup.标签名可以选择所需要的标签中的内容，返回结果是Tag类型。它查找的是在所有内容中的第一个符合要求的标签，例如上面的

的内容。

如果要查询所有的标签，在后面进行介绍。

获取名称

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(html,'lxml')

print(soup.title.name)

title

获取属性

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(html,'lxml')

print (soup.p.attrs['name'])

print(soup.p['name'])

dromouse

dromouse

获取内容

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(html,'lxml')

print (soup.p.string)

The Dormouse's story

嵌套选择

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(html,'lxml')

print (soup.head.title.string)

The Dormouse's story

获取子节点

.contents方法：

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(html,'lxml')

print(soup.p.contents)

[<b>The Dormouse's story</b>]

.content 属性可以将tag的子节点以列表的方式输出，输出方式为列表，我们可以用列表索引来获取它的某一个元素，如：

print(soup.head.contents[0])

#返回结果：<title>The Dormouse's story</title>

.children方法：

print(soup.head.children)

#返回结果：<listiterator object at 0x7f71457f5710>

他返回一个迭代器对象，可以通过遍历获得其中的子节点，如：

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(html,'lxml')

print(soup.head.children)

for child in  soup.body.children:

    print (child)

<list_iterator object at 0x0000019BAF29BE48>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

获取子孙节点

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(html,'lxml')

for child in soup.body.descendants:

    print (child)

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<b>The Dormouse's story</b>

The Dormouse's story

<p class="story">Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

 Elsie

,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

Lacie

 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

Tillie

;

and they lived at the bottom of a well.

<p class="story">...</p>

...

以上代码可以获取到head标签的所有子孙节点中的内容。

.descendants返回的是一个迭代器的类型，可以通过遍历输出所有的结果。

获取多个内容

.strings 获取多个内容，不过需要遍历获取，比如下面的例子

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(html,'lxml')

for string in soup.strings:

    print(repr(string))

"The Dormouse's story"

'\n'

'\n'

"The Dormouse's story"

'\n'

'Once upon a time there were three little sisters; and their names were\n'

',\n'

'Lacie'

' and\n'

'Tillie'

';\nand they lived at the bottom of a well.'

'\n'

'...'

'\n'

发现将所有标签的内容返回了出来，但是也有许多空的标签，也会进行返回，如果想过滤空行，可以使用.stripped_strings 函数。

.stripped_strings 输出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多余空白内容

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(html,'lxml')

for string in soup.stripped_strings:

    print(repr(string))

"The Dormouse's story"

"The Dormouse's story"

'Once upon a time there were three little sisters; and their names were'

','

'Lacie'

'and'

'Tillie'

';\nand they lived at the bottom of a well.'

'...'

获取父节点

.parent 属性 获取此标签的上一层标签

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(html,'lxml')

p = soup.p

print (p.parent.name)

body

获取祖先节点

.parents 属性

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(html,'lxml')

content = soup.head.title.string

for parent in  content.parents:

    print (parent.name)

title

head

html

[document]

获取兄弟节点

.next_sibling 获取此标签后面的节点

.previous_sibling 获取此标签前面的节点

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(html,'lxml')

print (soup.p.next_sibling)

#       实际该处为空白

print (soup.p.prev_sibling)

#None   没有前一个兄弟节点，返回 None

print (soup.p.next_sibling.next_sibling)

None

<p class="story">Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

兄弟节点可以理解为和本节点处在统一级的节点，.next_sibling 属性获取了该节点的下一个兄弟节点，.previous_sibling 则与之相反，如果节点不存在，则返回 None

注意：实际文档中的tag的 .next_sibling 和 .previous_sibling 属性通常是字符串或空白，因为空白或者换行也可以被视作一个节点，所以得到的结果可能是空白或者换行

获取前后节点

.next_element 获取此标签后一个节点

.previous_element 获取此标签前一个节点

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(html,'lxml')

print (soup.head.next_element)

<title>The Dormouse's story</title>

与 .next_sibling .previous_sibling 不同，它并不是针对于兄弟节点，而是在所有节点，不分层次。

比如 head 节点为：

<head><title>The Dormouse's story</title></head>

那么它的下一个节点便是 title，它是不分层次关系的

获取所有前后节点

.next_elements 获取后面的所有节点

.previous_elements获取前面的所有节点

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(html,'lxml')

for element in soup.body.next_elements:

    print(repr(element))

'\n'

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<b>The Dormouse's story</b>

"The Dormouse's story"

'\n'

<p class="story">Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

'Once upon a time there were three little sisters; and their names were\n'

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

' Elsie '

',\n'

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

'Lacie'

' and\n'

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

'Tillie'

';\nand they lived at the bottom of a well.'

'\n'

<p class="story">...</p>

'...'

'\n'

通过 .next_elements 和 .previous_elements 的迭代器就可以向前或向后访问文档的解析内容,就好像文档正在被解析一样

标准选择器

find_all( name , attrs , recursive , text , **kwargs )

可以根据标签名，属性，内容查找文档。

name 参数

可以传入一个标签名，会返回该标签名下的内容。

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(html,'lxml')

print(soup.find_all('a'))

print(type(soup.find_all('p')))

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

<class 'bs4.element.ResultSet'>

也可以通过层层嵌套的方法获取输出：

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(html,'lxml')

for p in soup.find_all('p'):

    print(p.find_all('a'))

attrs参数

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(html,'lxml')

print(soup.find_all(attrs={'name':'dromouse'}))

#也可以使用以下方法，输出结果相同

print(soup.find_all(class_='title'}))

[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>]

[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>]

text属性

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(html,'lxml')

print(soup.find_all(text='Lacie'))

['Lacie']
返回的是它里面的内容，但是在做元素查找时没有什么用处，但是做元素匹配时会有用处。

limit 参数

find_all() 方法返回全部的搜索结构,如果文档树很大那么搜索会很慢.如果我们不需要全部结果,可以使用 limit 参数限制返回结果的数量.效果与SQL中的limit关键字类似,当搜索到的结果数量达到 limit 的限制时,就停止搜索返回结果.

文档树中有3个tag符合搜索条件,但结果只返回了2个,因为我们限制了返回数量。

soup.find_all("a", limit=2)

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

find( name , attrs , recursive , text , **kwargs )

它与 find_all() 方法唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表,而 find() 方法直接返回结果，也就是说，find方法只会返回一个元素。

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(html,'lxml')

print(soup.find('p'))

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

其他用法一样的标准选择器

（1）find_parents() find_parent()

find_all() 和 find() 只搜索当前节点的所有子节点,孙子节点等. find_parents() 和 find_parent() 用来搜索当前节点的父辈节点,搜索方法与普通tag的搜索方法相同,搜索文档搜索文档包含的内容

（2）find_next_siblings() find_next_sibling()

这2个方法通过 .next_siblings 属性对当 tag 的所有后面解析的兄弟 tag 节点进行迭代, find_next_siblings() 方法返回所有符合条件的后面的兄弟节点,find_next_sibling() 只返回符合条件的后面的第一个tag节点

（3）find_previous_siblings() find_previous_sibling()

这2个方法通过 .previous_siblings 属性对当前 tag 的前面解析的兄弟 tag 节点进行迭代, find_previous_siblings() 方法返回所有符合条件的前面的兄弟节点, find_previous_sibling() 方法返回第一个符合条件的前面的兄弟节点

（4）find_all_next() find_next()

这2个方法通过 .next_elements 属性对当前 tag 的之后的 tag 和字符串进行迭代, find_all_next() 方法返回所有符合条件的节点, find_next() 方法返回第一个符合条件的节点

（5）find_all_previous() 和 find_previous()

这2个方法通过 .previous_elements 属性对当前节点前面的 tag 和字符串进行迭代, find_all_previous() 方法返回所有符合条件的节点, find_previous()方法返回第一个符合条件的节点

CSS选择器

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(html,'lxml')

print(soup.select('.story .sister'))

print(soup.select('p a'))

print(soup.select('#link2'))

print(soup.select('p')[0])

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

第一个打印的结果为class属性为‘story’的标签中class属性为‘sister’的元素，使用class进行选择时，名称前需要加一个"."。

第二个打印结果为p标签中a标签的内容，与第一个返回结果相同。

第三个打印结果为id为link2标签的内容，在使用id标签进行选择时需要在名字前面加"#"符号。

第四个打印的是第一个p标签的内容。

进行层层迭代的输出

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(html,'lxml')

for p in soup.select('p'):

    print(p.select('a'))

[]

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

[]

获取属性

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(html,'lxml')

for p in soup.select('p'):

    print(p['class'])

['title']

['story']

['story']

获取内容

from bs4 import BeautifulSoup

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

soup = BeautifulSoup(html,'lxml')

for p in soup.select('a'):

    print(p.get_text)

<bound method Tag.get_text of <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>>

<bound method Tag.get_text of <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>>

<bound method Tag.get_text of <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>>
获取到了p标签中a标签的所有内容

吴裕雄--天生自然python学习笔记：beautifulsoup库的使用的更多相关文章

吴裕雄--天生自然python学习笔记：Beautiful Soup 4.2.0模块
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时 ...
吴裕雄--天生自然python学习笔记：pandas模块强大的数据处理套件
用 Python 进行数据分析处理,其中最炫酷的就属 Pa ndas 套件了 . 比如,如果我们通过 Requests 及 Beautifulsoup 来抓取网页中的表格数据 , 需要进行较复杂的 ...
吴裕雄--天生自然python学习笔记：python 用pyInstaller模块打包文件
要想在没有安装 Python 集成环境的电脑上运行开发的 Python 程序,必须把 Python 文件打包成 .exe 格式的可执行文件. Python 的打包工作 PyInstaller 提供了 ...
吴裕雄--天生自然python学习笔记：python 文件批量查找
在多个文本文件中查找我们首先来学习文本文件的查找字符 . 我们通过 os.walk 扩大查找范围, 查找指定目录和子目录下的文件. 应用程序总览读取当前目录及子目录下的所有 PY 和 txt ...
吴裕雄--天生自然python学习笔记：python爬虫与网页分析
我们所抓取的网页源代码一般都是 HTML 格式的文件,只要研究明白 HTML 中的标签( Tag )结构,就很容易进行解析并取得所需数据 . HTML 网页结构 HTML 网页是由许多标签( Ta ...
吴裕雄--天生自然python学习笔记：WEB数据抓取与分析
Web 数据抓取技术具有非常巨大的应用需求及价值, 用 Python 在网页上收集数据,不仅抓取数据的操作简单, 而且其数据分析功能也十分强大. 通过 Python 的时lib 组件中的 urlpar ...
吴裕雄--天生自然python学习笔记：python通过“任务计划程序”实现定时自动下载或更新运行 PM2.5 数据抓取程序数据
在 Windows 任务计划程序中,设置每隔 30 分钟自动抓取 PM2.5 数据,井保存在 SQLite 数据库中 . import sqlite3,ast,requests,os from bs ...
吴裕雄--天生自然python学习笔记：python 用pygame模块动画一让图片动起来
动画是游戏开发中不可或缺的要素,游戏中的角色只有动起来才会拥有“生命”, 但动画处理也是最让游戏开发者头痛的部分.Pygame 包通过不断重新绘制绘图窗口,短短几行代码就可以让图片动起来! 动画处理程 ...
吴裕雄--天生自然python学习笔记：python 用pygame模块游戏开发
游戏开发在软件开发领域占据了非常重要的位直.游戏开发需要用到的技术相当广泛,除了多媒体.图片.动画的处理外,程序设计更是游戏开发的核心内容. Py game 是为了让 Python 能够进行游戏开 ...

随机推荐

Redis5新特性Streams作消息队列
前言 Redis 5 新特性中,Streams 数据结构的引入,可以说它是在本次迭代中最大特性.它使本次 5.x 版本迭代中,Redis 作为消息队列使用时,得到更完善,更强大的原生支持,其中尤为明显 ...
题解 P6005 【[USACO20JAN]Time is Mooney G】
抢第一篇题解这题的思路其实就是一个非常简单的dijkstra,如果跑到第一个点的数据不能更新的时候就输出很多人不知道要跑多少次才停.其实这题因为答案要减去 T*c^2,而每条边的值 <= 1 ...
Power BI 企业邮箱账户注册
Power BI 是免费的.但是一些功能需要企业账户才可以实现. 比如在线服务,移动端的服务,图标的市场,都需要注册账户实现. 1. 临时企业邮箱百度,或谷歌一个临时邮箱.这些邮箱大多都是有使用期限 ...
ccf-csp 任务调度，回溯算法我觉得ok神**wa了
#include<iostream> #include<string.h> #include<cmath> #define M 41 #define min(a,b ...
ZJNU 2212 - Turn-based game
Mr.Lee每隔1/x s攻击一次,cpu每隔1/y s攻击一次因为时间与答案无关,最后只看boss受到了多少次攻击所以可以在每个人的频率上同时乘以xy 即Mr.Lee每隔y s攻击一次,cpu每 ...
Myeclipse 10/2014 配置插件(svn、maven、properties、velocity)的方法
一.配置SVN详细图解什么是SVN? 管理软件开发过程中的版本控制工具. 下面会以两种方式来介绍怎么安装svn,myeclipse安装SVN插件步骤,以myeclipse 2014为例,第一种是最常 ...
redis day03 下
事务能够有回退状态事务命令安命令执行没问题,redis是弱事务型 nulti incr n1 -->QUEUED(返回仅队列了) EXEC -->返回结果 pipeline 流水 ...
spi设备描述过程
一.spi通信中控制器驱动及spi设备.spi设备驱动的关系入下图: 控制器驱动以及设备全志已经完成,在/driver/spi/spi--sunxi.c 中,打开源码文件可以看到spi控制器属于平 ...
python程序的打开运行方式
python程序的运行方式大致可以分为两种,一种是直接通过python解释器直接解释型运行,另外一种是先把python程序编译为二进制文件再运行. .源代码 -python源代码的文件以"p ...
研究NLP100篇必读的论文---已整理可直接下载
100篇必读的NLP论文 100 Must-Read NLP 自己汇总的论文集,已更新链接:https://pan.baidu.com/s/16k2s2HYfrKHLBS5lxZIkuw 提取码:x ...

吴裕雄--天生自然python学习笔记：beautifulsoup库的使用

吴裕雄--天生自然python学习笔记：beautifulsoup库的使用的更多相关文章

随机推荐

热门专题