爬虫5：beautifulsoup

灵活方便的网页解析库，处理高效，支持多种解析器，利用它不用编写正则表达式即可方便的实现网页信息的提取

一. BeautifulSoup的几种解析库

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, "html.parser")	Python的内置标准库、执行速度适中、文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文容错能力差
lxml HTML 解析器	BeautifulSoup(markup, "lxml")	速度快、文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, "xml")	速度快、唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, "html5lib")	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

安装lxml的方法

pip3 install wheel

https://pypi.org/project/lxml/ 下载对应lxml.whl

pip3 install lxml-4.2.3-cp35-cp35m-win_amd64.whl 进入到whl文件所在目录，安装相应的lxml

二. 基本使用

例子1，如下一个不完成的html代码

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

//解析获取信息

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

print(soup.prettify()) //格式化代码，自动把不完整的代码补全

print(soup.title.string) //打印title标签的内容

输出信息：

<html>

 <head>

  <title>

   The Dormouse's story

  </title>

 </head>

 <body>

  <p class="title" name="dromouse">

   <b>

    The Dormouse's story

   </b>

  </p>

  <p class="story">

   Once upon a time there were three little sisters; and their names were

   <a class="sister" href="http://example.com/elsie" id="link1">

    <!-- Elsie -->

   </a>

   ,

   <a class="sister" href="http://example.com/lacie" id="link2">

    Lacie

   </a>

   and

   <a class="sister" href="http://example.com/tillie" id="link3">

    Tillie

   </a>

   ;

and they lived at the bottom of a well.

  </p>

  <p class="story">

   ...

  </p>

 </body>

</html>

The Dormouse's story

二. 标签选择器

速度快，但不能满足html解析需求

1. 选择元素

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

//爬取信息

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

print(soup.title)

print(type(soup.title))

print(soup.head)

print(soup.p) //只输出第一个匹配结果

输出结果：

<title>The Dormouse's story</title>

<class 'bs4.element.Tag'>

<head><title>The Dormouse's story</title></head>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

2. 获取名称

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

//爬取标签名称

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

print(soup.title.name)

输出信息：

title

3. 获取属性

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

//获取第一个p标签的name属性值，这里提供了两种方法

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

print(soup.p.attrs['name'])

print(soup.p['name'])

输出结果：

dromouse

4. 获取内容

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p clss="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

print(soup.p.string)

输出结果：

The Dormouse's story

5. 嵌套选择

html = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

print(soup.head.title.string) //用.一层层的选择来获取所需值

输出信息：

The Dormouse's story

6. 子节点和子孙节点

6.1：获得子节点的两种方法

html = """

<html>

    <head>

        <title>The Dormouse's story</title>

    </head>

    <body>

        <p class="story">

            Once upon a time there were three little sisters; and their names were

            <a href="http://example.com/elsie" class="sister" id="link1">

                <span>Elsie</span>

            </a>

            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>

            and

            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

            and they lived at the bottom of a well.

        </p>

        <p class="story">...</p>

"""

方法一：.contents获取子节点，输出类型为list

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

print(soup.p.contents)

输出信息：

['\n Once upon a time there were three little sisters; and their names were\n ', <a class="sister" href="http://example.com/elsie" id="link1">

<span>Elsie</span>

</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n and\n ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n and they lived at the bottom of a well.\n ']

方法二. 使用.children来获取子节点，是一个迭代器，返回索引和相应内容

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

print(soup.p.children)

for i, child in enumerate(soup.p.children):

    print(i, child)

输出信息：

<list_iterator object at 0x1064f7dd8>

0

Once upon a time there were three little sisters; and their names were

1 <a class="sister" href="http://example.com/elsie" id="link1">

<span>Elsie</span>

</a>

2

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

4

and

5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

6

and they lived at the bottom of a well.

6.2：使用.descendants来获取子孙节点，返回类型也是迭代器

html = """

<html>

    <head>

        <title>The Dormouse's story</title>

    </head>

    <body>

        <p class="story">

            Once upon a time there were three little sisters; and their names were

            <a href="http://example.com/elsie" class="sister" id="link1">

                <span>Elsie</span>

            </a>

            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>

            and

            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

            and they lived at the bottom of a well.

        </p>

        <p class="story">...</p>

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

print(soup.p.descendants)

for i, child in enumerate(soup.p.descendants):

    print(i, child)

输出结果：

<generator object descendants at 0x10650e678>

0

Once upon a time there were three little sisters; and their names were

1 <a class="sister" href="http://example.com/elsie" id="link1">

<span>Elsie</span>

</a>

2

3 <span>Elsie</span>

4 Elsie

5

6

7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

8 Lacie

9

and

10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

11 Tillie

12

and they lived at the bottom of a well.

7. 父节点和祖先节点

7.1 使用.parent获取父节点

html = """

<html>

    <head>

        <title>The Dormouse's story</title>

    </head>

    <body>

        <p class="story">

            Once upon a time there were three little sisters; and their names were

            <a href="http://example.com/elsie" class="sister" id="link1">

                <span>Elsie</span>

            </a>

            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>

            and

            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

            and they lived at the bottom of a well.

        </p>

        <p class="story">...</p>

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

print(soup.a.parent)

输出信息：

<p class="story">

            Once upon a time there were three little sisters; and their names were

            <a class="sister" href="http://example.com/elsie" id="link1">

<span>Elsie</span>

</a>

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

            and

            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

            and they lived at the bottom of a well.

        </p>

7.2：使用.parents获取祖先节点

html = """

<html>

    <head>

        <title>The Dormouse's story</title>

    </head>

    <body>

        <p class="story">

            Once upon a time there were three little sisters; and their names were

            <a href="http://example.com/elsie" class="sister" id="link1">

                <span>Elsie</span>

            </a>

            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>

            and

            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

            and they lived at the bottom of a well.

        </p>

        <p class="story">...</p>

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

print(list(enumerate(soup.a.parents)))

输出信息：

 [(0, <p class="story">

             Once upon a time there were three little sisters; and their names were

             <a class="sister" href="http://example.com/elsie" id="link1">

 <span>Elsie</span>

 </a>

 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

             and

             <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

             and they lived at the bottom of a well.

         </p>), (1, <body>

 <p class="story">

             Once upon a time there were three little sisters; and their names were

             <a class="sister" href="http://example.com/elsie" id="link1">

 <span>Elsie</span>

 </a>

 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

             and

             <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

             and they lived at the bottom of a well.

         </p>

 <p class="story">...</p>

 </body>), (2, <html>

 <head>

 <title>The Dormouse's story</title>

 </head>

 <body>

 <p class="story">

             Once upon a time there were three little sisters; and their names were

             <a class="sister" href="http://example.com/elsie" id="link1">

 <span>Elsie</span>

 </a>

 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

             and

             <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

             and they lived at the bottom of a well.

         </p>

 <p class="story">...</p>

 </body></html>), (3, <html>

 <head>

 <title>The Dormouse's story</title>

 </head>

 <body>

 <p class="story">

             Once upon a time there were three little sisters; and their names were

             <a class="sister" href="http://example.com/elsie" id="link1">

 <span>Elsie</span>

 </a>

 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

             and

             <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

             and they lived at the bottom of a well.

         </p>

 <p class="story">...</p>

 </body></html>)]

8. 兄弟节点

next_siblings获取后面的兄弟节点，previous_siblings获取前面的兄弟节点

html = """

<html>

    <head>

        <title>The Dormouse's story</title>

    </head>

    <body>

        <p class="story">

            Once upon a time there were three little sisters; and their names were

            <a href="http://example.com/elsie" class="sister" id="link1">

                <span>Elsie</span>

            </a>

            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>

            and

            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

            and they lived at the bottom of a well.

        </p>

        <p class="story">...</p>

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

print(list(enumerate(soup.a.next_siblings)))

print(list(enumerate(soup.a.previous_siblings)))

输出信息：

[(0, '\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \n and\n '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n and they lived at the bottom of a well.\n ')]

[(0, '\n Once upon a time there were three little sisters; and their names were\n ')]

三. 标准选择器

一般常配合find_all和find方法使用，他们都可根据标签名，属性，内容来查找相应网页内容，使用方法一样。区别：find返回第一个元素，find_all返回所有元素

find_all( name , attrs , recursive , text , **kwargs )

1. 根据标签名查找

1.1：根据标签名查找所有ul标签，返回类型是列表，每个元素类型都是tag类型

html='''

<div class="panel">

    <div class="panel-heading">

        <h4>Hello</h4>

    </div>

    <div class="panel-body">

        <ul class="list" id="list-1">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

            <li class="element">Jay</li>

        </ul>

        <ul class="list list-small" id="list-2">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

        </ul>

    </div>

</div>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

print(soup.find_all('ul'))

print(type(soup.find_all('ul')[0]))

输出结果：

[<ul class="list" id="list-1">

</ul>, <ul class="list list-small" id="list-2">

</ul>]

1.2：查找所有ul标签，并分别打印出每个ul标签下的li标签

html='''

<div class="panel">

    <div class="panel-heading">

        <h4>Hello</h4>

    </div>

    <div class="panel-body">

        <ul class="list" id="list-1">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

            <li class="element">Jay</li>

        </ul>

        <ul class="list list-small" id="list-2">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

        </ul>

    </div>

</div>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

for ul in soup.find_all('ul'):

    print(ul.find_all('li'))

输出结果：

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]

[<li class="element">Foo</li>, <li class="element">Bar</li>]

2. attrs

方法1：通过id属性和name属性来查找元素

html='''

<div class="panel">

    <div class="panel-heading">

        <h4>Hello</h4>

    </div>

    <div class="panel-body">

        <ul class="list" id="list-1" name="elements">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

            <li class="element">Jay</li>

        </ul>

        <ul class="list list-small" id="list-2">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

        </ul>

    </div>

</div>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

print(soup.find_all(attrs={'id': 'list-1'}))

print(soup.find_all(attrs={'name': 'elements'}))

输出结果，两个输出是一样的，都是第一个ul标签内容：

[<ul class="list" id="list-1" name="elements">

<li class="element">Foo</li>

<li class="element">Bar</li>

<li class="element">Jay</li>

</ul>]

[<ul class="list" id="list-1" name="elements">

<li class="element">Foo</li>

<li class="element">Bar</li>

<li class="element">Jay</li>

</ul>]

方法2：通过使用类名属性来查找，因为class相当于一个关键字，所有需要用_class来作为一个属性来选择匹配

html='''

<div class="panel">

    <div class="panel-heading">

        <h4>Hello</h4>

    </div>

    <div class="panel-body">

        <ul class="list" id="list-1">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

            <li class="element">Jay</li>

        </ul>

        <ul class="list list-small" id="list-2">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

        </ul>

    </div>

</div>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

print(soup.find_all(id='list-1'))

print(soup.find_all(class_='element'))

输出结果

[<ul class="list" id="list-1">

</ul>]

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]

3. text，根据文本内容进行选择

html='''

<div class="panel">

    <div class="panel-heading">

        <h4>Hello</h4>

    </div>

    <div class="panel-body">

        <ul class="list" id="list-1">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

            <li class="element">Jay</li>

        </ul>

        <ul class="list list-small" id="list-2">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

        </ul>

    </div>

</div>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

print(soup.find_all(text='Foo'))

输出结果：

['Foo', 'Foo']

find( name , attrs , recursive , text , **kwargs )

html='''

<div class="panel">

    <div class="panel-heading">

        <h4>Hello</h4>

    </div>

    <div class="panel-body">

        <ul class="list" id="list-1">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

            <li class="element">Jay</li>

        </ul>

        <ul class="list list-small" id="list-2">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

        </ul>

    </div>

</div>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

print(soup.find('ul'))

print(type(soup.find('ul')))

print(soup.find('page'))

输出结果：

</ul>

None

1. find_parents() find_parent()

find_parents()返回所有祖先节点，find_parent()返回直接父节点。

2. find_next_siblings() find_next_sibling()

find_next_siblings()返回后面所有兄弟节点，find_next_sibling()返回后面第一个兄弟节点。

3. find_previous_siblings() find_previous_sibling()

find_previous_siblings()返回前面所有兄弟节点，find_previous_sibling()返回前面第一个兄弟节点。

4. find_all_next() find_next()

find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点

5. find_all_previous() 和 find_previous()

find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点

四. CSS选择器

通过select()直接传入CSS选择器即可完成选择

1.1：使用select方法传入css选择器

html='''

<div class="panel">

    <div class="panel-heading">

        <h4>Hello</h4>

    </div>

    <div class="panel-body">

        <ul class="list" id="list-1">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

            <li class="element">Jay</li>

        </ul>

        <ul class="list list-small" id="list-2">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

        </ul>

    </div>

</div>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

print(soup.select('.panel .panel-heading')) #选择类

print(soup.select('ul li'))  #选择标签

print(soup.select('#list-2 .element')) #选择ID

print(type(soup.select('ul')[0]))

输出结果：

1. [<div class="panel-heading">

<h4>Hello</h4>

</div>]

2. [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]

3. [<li class="element">Foo</li>, <li class="element">Bar</li>]

4. <class 'bs4.element.Tag'>

1.2：使用循环

html='''

<div class="panel">

    <div class="panel-heading">

        <h4>Hello</h4>

    </div>

    <div class="panel-body">

        <ul class="list" id="list-1">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

            <li class="element">Jay</li>

        </ul>

        <ul class="list list-small" id="list-2">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

        </ul>

    </div>

</div>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

for ul in soup.select('ul'):

    print(ul.select('li'))

输出结果：

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]

[<li class="element">Foo</li>, <li class="element">Bar</li>]

2. 获取属性

输出所有ul标签的id属性，两种方法

html='''

<div class="panel">

    <div class="panel-heading">

        <h4>Hello</h4>

    </div>

    <div class="panel-body">

        <ul class="list" id="list-1">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

            <li class="element">Jay</li>

        </ul>

        <ul class="list list-small" id="list-2">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

        </ul>

    </div>

</div>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

for ul in soup.select('ul'):

    print(ul['id'])  //方法1

    print(ul.attrs['id']) //方法2

输出结果：

list-1

list-2

3. 获取内容

输出所有li标签中的文本内容

html='''

<div class="panel">

    <div class="panel-heading">

        <h4>Hello</h4>

    </div>

    <div class="panel-body">

        <ul class="list" id="list-1">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

            <li class="element">Jay</li>

        </ul>

        <ul class="list list-small" id="list-2">

            <li class="element">Foo</li>

            <li class="element">Bar</li>

        </ul>

    </div>

</div>

'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

for li in soup.select('li'):

    print(li.get_text())

输出结果：

Foo

Bar

Jay

Foo

Bar

总结

推荐使用lxml解析库，必要时使用html.parser
标签选择筛选功能弱但是速度快
建议使用find()、find_all() 查询匹配单个结果或者多个结果，返回类型都是list
如果对CSS选择器熟悉建议使用select()
记住常用的获取属性和文本值的方法