爬虫基本库之beautifulsoup

一、beautifulsoup的简单使用

简单来说，Beautiful Soup是python的一个库，最主要的功能是从网页抓取数据。官方解释如下：

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。

它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

更多知识访问：官方文档

1.安装

pip3 install beautifulsoup4

（1）解析器

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器，如果我们不安装它，则 Python 会使用 Python默认的解析器，lxml 解析器更加强大，速度更快，推荐安装

pip3 install lxml

另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib:

pip install html5lib

（2）解析器对比

2.快速开始

下面的一段HTML代码将作为例子被多次用到.这是 爱丽丝梦游仙境的 的一段内容(以后内容中简称为 爱丽丝 的文档):

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser') #<class 'bs4.BeautifulSoup'> 类型,html解析器：html.parser

print(soup.prettify())   #以标准格式输出

结果展示：

 <html>

  <head>

   <title>

    The Dormouse's story

   </title>

  </head>

  <body>

   <p class="title">

    <b>

     The Dormouse's story

    </b>

   </p>

   <p class="story">

    Once upon a time there were three little sisters; and their names were

    <a class="sister" href="http://example.com/elsie" id="link1">

     Elsie

    </a>

    ,

    <a class="sister" href="http://example.com/lacie" id="link2">

     Lacie

    </a>

    and

    <a class="sister" href="http://example.com/tillie" id="link3">

     Tillie

    </a>

    ;

 and they lived at the bottom of a well.

   </p>

   <p class="story">

    ...

   </p>

  </body>

 </html>

二、beautifulsoup的遍历文档树

几个简单的浏览结构化数据的方法:

操作文档树最简单的方法就是告诉它你想获取的tag的name。

(1)如果想获取 <head> 标签,只要用 soup.head :

soup.head

# <head><title>The Dormouse's story</title></head>

soup.title

# <title>The Dormouse's story</title>

还可以连续获取：

soup.body.b

# <b>The Dormouse's story</b>

注意：通过点的方式只能获取当前名字的第一个标签

soup.a  #总共又三个

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

如果想获取所有标签，可以使用find_all()

soup.find_all('a')

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

（2）.contents 和 .children 以及.descendants（子节点）

.contents：将标签的的所有子节点以列表的形式输出，既然是列表，那就可以有列表的操作

head_tag.contents

[<title>The Dormouse's story</title>]

soup.contents[1].name  #切片当然可以

# u'html'

.children：返回一个包含所有子节点的生成器，可以对其进行循环。

for child in title_tag.children:

    print(child)

    # The Dormouse's story

.descendants:返回一个包含所有子孙节点的生成器。

print(soup.head.contents)  #直接的子标签只有一个

# [<title>The Dormouse's story</title>]

for i in soup.head.descendants:  #子标签有一个，还有一个孙子标签

    print(i)

# < title > TheDormouse's story</title>

# The Dormouse's story

#注意：字符串也可以作为一个独立的标签

（3）.string 和 .stripped_strings

.string可以用户获取标签的内容，如果子标签有多个

print(soup.title.string)

# The Dormouse's story

print(soup.head.string)   #即使有多层标签，也可以打印出来

# The Dormouse's story

print(soup.body.string) #由于有多个子节点，所以不知道去哪一个

# None

for i in soup.body:   #有多个子节点可以使用循环，

    print(i)

.stripped_strings 可以去除多余空白内容

for string in soup.stripped_strings:

    print(repr(string))

# "The Dormouse's story"

# "The Dormouse's story"

# 'Once upon a time there were three little sisters; and their names were'

# 'Elsie'

# ','

# 'Lacie'

# 'and'

# 'Tillie'

# ';\nand they lived at the bottom of a well.'

# '...'

（4）parent 和 parents（父节点）

.parent 属性来获取某个元素的父节点

print(soup.title.parent)

# <head><title>The Dormouse's story</title></head>

.parents 属性可以递归得到元素的所有父辈节点

for i in soup.a.parents:  #它是一次从内到外

    print(i.name)

# p

# body

# html

# [document]

# None

（5）兄弟节点

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")

print(sibling_soup.prettify())

# <html>

#  <body>

#   <a>

#    <b>

#     text1

#    </b>

#    <c>

#     text2

#    </c>

#   </a>

#  </body>

# </html>

.next_sibling会向下找兄弟

.previous_sibling会向上找兄弟

当你需要判断两个节点是否是兄弟节点的时候，你只需要查看其父节点是否相同就行。

sibling_soup.b.next_sibling

# <c>text2</c>

sibling_soup.c.previous_sibling

# <b>text1</b>

.next_siblings和.previous_siblings可以对当前节点的兄弟节点迭代输出

for i in enumerate(soup.a.next_siblings,1):  #向下找

    print(i)

# (1, ',\n')

# (2, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>)

# (3, ' and\n')

# (4, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>)

# (5, ';\nand they lived at the bottom of a well.')

for i in enumerate(soup.a.previous_siblings,1):  #向上找

    print(i)

# (1, 'Once upon a time there were three little sisters; and their names were\n')

（6）回退和前进

首先需要了解一下解析的流程，例如下面字段：

<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

HTML解析器把这段字符串转换成一连串的事件: “打开<html>标签”,”打开一个<head>标签”,”打开一个<title>标签”,”添加一段字符串”,”关闭<title>标签”,”打开标签”,等等

.next_element 属性结果是在<a>标签被解析之后的解析内容,不是<a>标签后的句子部分,应该是字符串”Tillie”。

print(soup.find("a",id="link2").next_element)

#Lacie

.previous_element 它指向当前被解析的对象的前一个解析对象

print(soup.find("a",id="link2").previous_element)

# ,

通过 .next_elements 和 .previous_elements 的迭代器就可以向前或向后访问文档的解析内容,就好像文档正在被解析一样:

for element in soup.find("a",id="link3").next_elements:

    print(repr(element))

# 'Tillie'

# ';\nand they lived at the bottom of a well.'

# '\n'

# <p class="story">...</p>

# '...'

# '\n'

三、beautifulsoup的搜索文档树

1.find_all

find_all( name , attrs , recursive , string , **kwargs )

find_all() 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件:

soup.find_all("title")

# [<title>The Dormouse's story</title>]

soup.find_all("p", "title")

# [<p class="title"><b>The Dormouse's story</b></p>]

soup.find_all("a")

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all(id="link2")

# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

import re

soup.find(string=re.compile("sisters"))

# u'Once upon a time there were three little sisters; and their names were\n'

有几个方法很相似,还有几个方法是新的,参数中的 string 和 id 是什么含义?

为什么 find_all("p", "title") 返回的是CSS Class为”title”的标签? 我们来仔细看一下 find_all() 的参数.

（1）name参数

name 参数可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉.

简单的用法如下:

soup.find_all("title")

# [<title>The Dormouse's story</title>]

搜索 name 参数的值可以使任一类型的过滤器 ,字符窜,正则表达式,列表,方法或是 True .

<1>传字符串

最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的标签。

soup.find_all('b')

# [<b>The Dormouse's story</b>]

<2>传正则表达式

如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 match() 来匹配内容.下面例子中找出所有以b开头的标签,这表示<body>和标签都应该被找到

import re

for tag in soup.find_all(re.compile("^b")):

    print(tag.name)

# body

# b

<3>传列表

如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有<a>标签和标签。‘

soup.find_all(["a", "b"])

# [<b>The Dormouse's story</b>,

#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

<4> 传 True

True 可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点。

for tag in soup.find_all(True):

    print(tag.name)

'''

html

head

title

body

p

b

p

a

a

a

p

'''

<5>传方法

如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数,如果这个方法返回 True 表示当前元素匹配并且被找到,如果不是则反回 False。

下面方法校验了当前元素,如果包含 class 属性却不包含 id 属性,那么将返回 True：

def has_class_but_no_id(tag):

    return tag.has_attr('class') and not tag.has_attr('id')

将这个方法作为参数传入 find_all() 方法,将得到所有标签：

print(soup.find_all(has_class_but_no_id))

'''

[

<p class="title"><b>The Dormouse's story</b></p>,

<p class="story">Once upon a time there were three little sisters; and their names were

    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and

    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

    and they lived at the bottom of a well.

</p>,

<p class="story">...</p>

]

'''

（2）keyword参数

如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索，

如果包含一个名字为 id 的参数,Beautiful Soup会搜索每个tag的”id”属性。

soup.find_all(id='link2')

# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

import re

print(soup.find_all(href=re.compile("elsie")))

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

搜索指定名字的属性时可以使用的参数值包括字符串 , 正则表达式 , 列表, True 。

下面的例子在文档树中查找所有包含 id 属性的tag,无论 id 的值是什么：

soup.find_all(id=True)

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

使用多个指定名字的参数可以同时过滤tag的多个属性：

soup.find_all(href=re.compile("elsie"), id='link1')

# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

在这里我们想用 class 过滤，不过 class 是 python 的关键词，这怎么办？加个下划线就可以

print(soup.find_all("a", class_="sister"))

'''

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

]

'''

通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag：

data_soup.find_all(attrs={"data-foo": "value"})

# [<div data-foo="value">foo!</div>]

（3）text参数

通过 text 参数可以搜搜文档中的字符串内容.与 name 参数的可选值一样, text 参数接受字符串 , 正则表达式 , 列表, True。

import re

print(soup.find_all(text="Elsie"))

# ['Elsie']

print(soup.find_all(text=["Tillie", "Elsie", "Lacie"]))

# ['Elsie', 'Lacie', 'Tillie']

print(soup.find_all(text=re.compile("Dormouse")))

# ["The Dormouse's story", "The Dormouse's story"]

（4）limit参数

find_all() 方法返回全部的搜索结构,如果文档树很大那么搜索会很慢，如果我们不需要全部结果,可以使用 limit 参数限制返回结果的数量。

效果与SQL中的limit关键字类似,当搜索到的结果数量达到 limit 的限制时,就停止搜索返回结果。

print(soup.find_all("a",limit=2))

'''

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

'''

（5）recursive参数

调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False .

print(soup.html.find_all("title"))  # [<title>The Dormouse's story</title>]

print(soup.html.find_all("title",recursive=False))  # []

2.find

find( name , attrs , recursive , string , **kwargs )

find_all() 方法将返回文档中符合条件的所有tag,尽管有时候我们只想得到一个结果。

比如文档中只有一个<body>标签,那么使用 find_all() 方法来查找<body>标签就不太合适，

使用 find_all 方法并设置 limit=1 参数不如直接使用 find() 方法.下面两行代码是等价的:：

soup.find_all('title', limit=1)

# [<title>The Dormouse's story</title>]

soup.find('title')

# <title>The Dormouse's story</title>

唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表,而 find() 方法直接返回结果.

find_all() 方法没有找到目标是返回空列表, find() 方法找不到目标时,返回 None .

print(soup.find("nosuchtag"))

# None

soup.head.title 是 tag的名字方法的简写.这个简写的原理就是多次调用当前tag的 find() 方法:

soup.head.title

# <title>The Dormouse's story</title>

soup.find("head").find("title")

# <title>The Dormouse's story</title>

3.find_parents()和find_parent()

a_string = soup.find(string="Lacie")

print(a_string)  # Lacie

print(a_string.find_parent())

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

print(a_string.find_parents())

print(a_string.find_parent("p"))

'''

<p class="story">

    Once upon a time there were three little sisters; and their names were

    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and

    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

    and they lived at the bottom of a well.

</p>

'''

4.find_next_siblings()和find_next_sibling()

find_next_sibling( name , attrs , recursive , string , **kwargs )

这2个方法通过 .next_siblings 属性对当tag的所有后面解析的兄弟tag节点进行迭代,，

find_next_siblings() 方法返回所有符合条件的后面的兄弟节点, find_next_sibling() 只返回符合条件的后面的第一个tag节点.

first_link = soup.a

print(first_link.find_next_sibling("a"))

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

print(first_link.find_next_siblings("a"))

'''

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

]

'''

find_previous_siblings() 和 find_previous_sibling()的使用类似于find_next_sibling和find_next_siblings。

5.find_all_next()和find_next()

find_all_next( name , attrs , recursive , string , **kwargs )

find_next( name , attrs , recursive , string , **kwargs )

这2个方法通过 .next_elements 属性对当前tag的之后的tag和字符串进行迭代，

find_all_next() 方法返回所有符合条件的节点, find_next() 方法返回第一个符合条件的节点：

first_link = soup.a

print(first_link.find_all_next(string=True))

# ['Elsie', ',\n', 'Lacie', ' and\n', 'Tillie', ';\nand they lived at the bottom of a well.', '\n', '...', '\n']

print(first_link.find_next(string=True)) # Elsie

find_all_previous() 和 find_previous()的使用类似于find_all_next() 和 find_next()。

四、beautifulsoup的CSS选择器

我们在写 CSS 时，标签名不加任何修饰，类名前加点，id名前加 #，在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list。

（1）通过标签查找

print(soup.select("title"))  #[<title>The Dormouse's story</title>]

print(soup.select("b"))      #[<b>The Dormouse's story</b>]

（2）通过类名查找

print(soup.select(".sister")) 

'''

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

'''

（3）通过ID名查找

print(soup.select("#link1"))

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

（4）组合查找

组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是一样的，例如查找 p 标签中，id 等于 link1的内容，二者需要用空格分开

print(soup.select("p #link2"))

#[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

直接子标签查找

print(soup.select("p > #link2"))

# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

（5）属性查找

查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。

print(soup.select("a[href='http://example.com/tillie']"))

#[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

select 方法返回的结果都是列表形式，可以遍历形式输出，然后用 get_text() 方法来获取它的内容：

for title in soup.select('a'):

    print (title.get_text())

'''

Elsie

Lacie

Tillie

'''

爬虫基本库之beautifulsoup的更多相关文章

python爬虫解析库之Beautifulsoup模块
一介绍 Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会 ...
爬虫解析库re,Beautifulsoup,
re模块点我回顾 Beautifulsoup模块 #安装 Beautiful Soup pip install beautifulsoup4 #安装解析器 Beautiful Soup支持Pytho ...
爬虫基础库之beautifulsoup的简单使用
beautifulsoup的简单使用简单来说,Beautiful Soup是python的一个库,最主要的功能是从网页抓取数据.官方解释如下: ''' Beautiful Soup提供一些简单的.p ...
python爬虫学习(一)：BeautifulSoup库基础及一般元素提取方法
最近在看爬虫相关的东西,一方面是兴趣,另一方面也是借学习爬虫练习python的使用,推荐一个很好的入门教程:中国大学MOOC的<python网络爬虫与信息提取>,是由北京理工的副教授嵩天老 ...
爬虫解析库——BeautifulSoup
解析库就是在爬虫时自己制定一个规则,帮助我们抓取想要的内容时用的.常用的解析库有re模块的正则.beautifulsoup.pyquery等等.正则完全可以帮我们匹配到我们想要住区的内容,但正则比较麻 ...
python爬虫入门四：BeautifulSoup库(转)
正则表达式可以从html代码中提取我们想要的数据信息,它比较繁琐复杂,编写的时候效率不高,但我们又最好是能够学会使用正则表达式. 我在网络上发现了一篇关于写得很好的教程,如果需要使用正则表达式的话,参 ...
爬虫----爬虫解析库Beautifulsoup模块
一:介绍 Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你 ...
爬虫解析库BeautifulSoup的一些笔记
BeautifulSoup类使用基本元素说明 Tag 标签,最基本的信息组织单元,分别是<>和</>标明开头和结尾 Name 标签的名字,</p ...
爬虫解析库beautifulsoup
一.介绍 Beautiful Soup是一个可以从HTML或XML文件中提取数据的python库. #安装Beautiful Soup pip install beautifulsoup4 #安装解析 ...

随机推荐

Django学习之项目结构优化
其实就是采用包结构,比如: 目录models,包含__init__.py,a.py,b.py 然后将model class写在a和b中,但是这样的话,导入时就要改变了! from models imp ...
linux 命令之 ping
ping命令主要用于检測主机的连通性. 语法: ping [-dfnqrRv] [-c <完毕次数>] [-i <间隔秒数>] [-I <网络接口>] [-l &l ...
eclipse 打开的时候弹出 'Building workspace' has encountered a problem. Errors occurred during
Eclipse 里面project->Build Automatically上的对勾去掉
flex初始化方法
initalize是初始化,creationcomplete是创建完成,applicationComplete是应用程序中所有的实例都创建完成后才执行,三者的执行顺序是intalize creatio ...
PHP : ActiveRecord实现示例
先简单介绍一下Active Record: Active Record(中文名:活动记录)是一种领域模型模式,特点是一个模型类对应关系型数据库中的一个表,而模型类的一个实例对应表中的一行记录.Acti ...
tcp/iP协议族——IP工作原理及实例具体解释（下）
IP协议具体解释上一篇文章文章主要介绍了IP服务的特点,IPv4头部结构IP分片.并用tcpdump抓取数据包,来观察IP数据报传送过程中IP的格式,以及分片的过程.本文主要介绍IP路由,IP ...
初涉Quartz
1.首先需要导入包,必须导入的包如下: quartz-1.8.5.jar commons-logging.jar spring-core-3.0.5.RELEASE.jar sprin ...
mysql UNION all 实现不对称数据统计
当统计多条的三个参数在不同时间段的数据的sum,又只能写在同一个sql上时,可以考虑union all三次查询, select * from ( select kk.a_time as dates,k ...
java中InputStream转化为byte[]数组
//org.apache.commons.io.IOUtils.toByteArray已经有实现 String filePath = "D:\\aaa.txt"; in = new ...
多用户商城系统 KgMall2.1公布
2014-5-28日,广州JUULUU公布多用户商城系统 KgMall2.1,kgMall是国内一款JAVA开源多用户版商城系统,新版KgMall更加模块化,juuluu团队重构了Kgcms的多个模块 ...

爬虫基本库之beautifulsoup

爬虫基本库之beautifulsoup的更多相关文章

随机推荐

热门专题