python 爬虫利器 Beautiful Soup

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

这里不再介绍其安装过程，可以同anaconda 管理工具一步安装，并自动安装依赖的相关包。

Beautiful Soup 使用

# 首先从 bs4 导入

from bs4 inport BeautifulSoup

简单实用举例说明

from bs4 import BeautifulSoup

html = '''

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

'''

使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出:

soup = BeautifulSoup(html,'lxml')

print('***'*10)

print(soup.prettify())

输出结果

<html>

 <head>

  <title>

   The Dormouse's story

  </title>

 </head>

 <body>

  <p class="title">

   <b>

    The Dormouse's story

   </b>

  </p>

  <p class="story">

   Once upon a time there were three little sisters; and their names were

   <a class="sister" href="http://example.com/elsie" id="link1">

    Elsie

   </a>

   ,

   <a class="sister" href="http://example.com/lacie" id="link2">

    Lacie

   </a>

   and

   <a class="sister" href="http://example.com/tillie" id="link3">

    Tillie

   </a>

   ;

and they lived at the bottom of a well.

  </p>

  <p class="story">

   ...

  </p>

 </body>

</html>

其他属性输出

print(soup.title)

# <title>The Dormouse's story</title>

print(soup.title.name)

# title

print(soup.title.string)

# The Dormouse's story

print(soup.title.parent.name)

# head

print(soup.p)

# <p class="title"><b>The Dormouse's story</b></p>

print(soup.p["class"])

# ['title']

print(soup.a)

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

print(soup.find_all('a'))

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(soup.find(id='link3'))

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

# 从文档中找到所有<a>标签的链接:

# for link in soup.findAll('a'):

#     print(link.get('href'))

for link in soup.find_all('a'):

    print(link.get('href'))

# http://example.com/elsie

# http://example.com/lacie

# http://example.com/tillie

# 从文档中获取所有文字内容:

# print(soup.getText())

print(soup.get_text())

# The Dormouse's story

#

# The Dormouse's story

# Once upon a time there were three little sisters; and their names were

# Elsie,

# Lacie and

# Tillie;

# and they lived at the bottom of a well.

# ...

Beautiful Soup解析器

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml

下表列出了主要的解析器,以及它们的优缺点:

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, “html.parser”)	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	BeautifulSoup(markup, “lxml”)	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, [“lxml”, “xml”])BeautifulSoup(markup, “xml”)	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, “html5lib”)	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

对象的种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment .

【Tag】

tag 对象类似于一个标签节点。与XML或HTML原生文档中的标签相同，如 body，div，a，span。tag 对象有很多方法和属性。tag 对象的属性可以像字典一样进行增删改查操作。

name 属性

name 属性表示 tag 的名称。通过 .name 获取。如果改变了tag的name,那将影响所有通过当前Beautiful Soup对象生成的HTML文档。

# 获取 tag 名字为 a 的标签

tag = soup.a

print(tag)

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

print(tag.name)

# a

print(soup.name)

# [document]

attributes 属性

一个tag可能有很多个属性，使用 tag.attrs 获取 tag 的所有节点属性，可以对这些属性进行增删改查。获取方法如下：

tag.attrs：获取属性列表
tag.attrs[1]：获取属性列表中的第2个属性
tag.get('href')：获取 href 属性
tag['href']：获取 href 属性

# 获取 tag 名字为 a 的标签

tag = soup.a

print(tag)

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

print(tag.name)

# a

print(tag.attrs)

# {'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

print(tag.get('href'))

# http://example.com/elsie

print(tag.get('class'))

# ['sister']

print(tag['class'])

# ['sister']

print(tag['id'])

# link1

多值属性

在HTML文档中有典型像 class 一样的有多个属性值，这些多值属性返回的值不是 string ，而是 list 。这些多值属性的节点类型如下：

class
rel
rev
accept-charset
headers
accesskey

在XML文档中没有多值属性

print(tag['class'])

# ['sister']

tag的属性可以被添加,删除或修改. 再说一次, tag的属性操作方法与字典一样

tag['class'] = 'verybold'

tag['id'] = 1

tag

# <blockquote class="verybold" id="1">Extremely bold</blockquote>

del tag['class']

del tag['id']

tag

# <blockquote>Extremely bold</blockquote>

tag['class']

# KeyError: 'class'

print(tag.get('class'))

# None

【NavigableString】

字符串常被包含在tag内.Beautiful Soup用 NavigableString 类来包装tag中的字符串:

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

soup = BeautifulSoup('<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>', 'lxml')

tag = soup.a

print(tag)

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

print(tag.string)

# Elsie

这样我们就轻松获取到了标签里面的内容，想想如果用正则表达式要多麻烦。它的类型是一个 NavigableString，翻译过来叫可以遍历的字符串，不过我们最好还是称它英文名字吧。

tag中包含的字符串不能编辑,但是可以被替换成其它的字符串,用 replace_with() 方法:

tag.string.replace_with('Jack')

print(tag)

# <a class="sister" href="http://example.com/elsie" id="link1">Jack</a>

print(tag.string)

# Jack

【BeautifulSoup】

BeautifulSoup 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象,它支持遍历文档树和搜索文档树中描述的大部分的方法.

因为 BeautifulSoup 对象并不是真正的HTML或XML的tag,所以它没有name和attribute属性.但有时查看它的 .name 属性是很方便的,所以 BeautifulSoup 对象包含了一个值为 “[document]” 的特殊属性 .name

soup = BeautifulSoup(html,'lxml')

# print(soup.prettify())

print(soup.name)

# [document]

【注释及特殊字符串-Comment】

Tag , NavigableString , BeautifulSoup 几乎覆盖了html和xml中的所有内容,但是还有一些特殊对象.容易让人担心的内容是文档的注释部分:

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"

soup = BeautifulSoup(markup)

comment = soup.b.string

print(comment)

# Hey, buddy. Want to buy a used parser?

print(type(comment))

# <class 'bs4.element.Comment'>

Comment 对象是一个特殊类型的 NavigableString 对象.

简单使用先介绍到这里，如需详细学习，可参考https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/#