Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

快速开始

html_doc = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

　　使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc)

print(soup.prettify())

# <html>

#  <head>

#   <title>

#    The Dormouse's story

#   </title>

#  </head>

#  <body>

#   <p class="title">

#    <b>

#     The Dormouse's story

#    </b>

#   </p>

#   <p class="story">

#    Once upon a time there were three little sisters; and their names were

#    <a class="sister" href="http://example.com/elsie" id="link1">

#     Elsie

#    </a>

#    ,

#    <a class="sister" href="http://example.com/lacie" id="link2">

#     Lacie

#    </a>

#    and

#    <a class="sister" href="http://example.com/tillie" id="link2">

#     Tillie

#    </a>

#    ; and they lived at the bottom of a well.

#   </p>

#   <p class="story">

#    ...

#   </p>

#  </body>

# </html>

　　几个简单的浏览结构化数据的方法：

soup.title

# <title>The Dormouse's story</title>

soup.title.name

# u'title'

soup.title.string

# u'The Dormouse's story'

soup.title.parent.name

# u'head'

soup.p

# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']

# u'title'

soup.a

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

　　从文档中找到所有<a>标签的链接:

for link in soup.find_all('a'):

    print(link.get('href'))

    # http://example.com/elsie

    # http://example.com/lacie

    # http://example.com/tillie

　　从文档中获取所有文字内容:

print(soup.get_text())

# The Dormouse's story

#

# The Dormouse's story

#

# Once upon a time there were three little sisters; and their names were

# Elsie,

# Lacie and

# Tillie;

# and they lived at the bottom of a well.

#

# ...

如何使用

将一段文档传入BeautifulSoup 的构造方法,就能得到一个文档的对象, 可以传入一段字符串或一个文件句柄.

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("index.html"))

soup = BeautifulSoup("<html>data</html>")

　　首先,文档被转换成Unicode,并且HTML的实例都被转换成Unicode编码

BeautifulSoup("Sacré bleu!")

<html><head></head><body>Sacré bleu!</body></html>

　　然后,Beautiful Soup选择最合适的解析器来解析这段文档,如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档.

对象的种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment .

Tag

Tag 对象与XML或HTML原生文档中的tag相同:

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')

tag = soup.b

type(tag)

# <class 'bs4.element.Tag'>

　　Tag有很多方法和属性,在遍历文档树和搜索文档树中有详细解释.现在介绍一下tag中最重要的属性: name和attributes.

Name

每个tag都有自己的名字,通过 .name 来获取:

tag.name

# u'b'

如果改变了tag的name,那将影响所有通过当前Beautiful Soup对象生成的HTML文档:

tag.name = "blockquote"

tag

# <blockquote class="boldest">Extremely bold</blockquote>

Attributes

一个tag可能有很多个属性. tag <b class="boldest"> 有一个 “class” 的属性,值为 “boldest” . tag的属性的操作方法与字典相同:

tag['class']

# u'boldest'

　　也可以直接”点”取属性, 比如: .attrs :

tag.attrs

# {u'class': u'boldest'}

　　tag的属性可以被添加,删除或修改. 再说一次, tag的属性操作方法与字典一样

tag['class'] = 'verybold'

tag['id'] = 1

tag

# <blockquote class="verybold" id="1">Extremely bold</blockquote>

del tag['class']

del tag['id']

tag

# <blockquote>Extremely bold</blockquote>

tag['class']

# KeyError: 'class'

print(tag.get('class'))

Beautiful Soup 4.2.0 文档（一）的更多相关文章

Beautiful Soup 4.2.0 文档
Beautiful Soup 4.2.0 文档 Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方 ...
吴裕雄--天生自然python学习笔记：Beautiful Soup 4.2.0模块
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时 ...
Beautiful Soup 4.2.0 doc_tag、Name、Attributes、多值属性
找到了bs4的中文文档,对昨天爬虫程序里所涉及的bs4库进行学习.这篇代码涉及到tag.Name.Attributes以及多值属性. ''' 对象的种类 Beautiful Soup将复杂HTML文档 ...
Beautiful Soup 4.4.0 基本使用方法
Beautiful Soup 4.4.0 基本使用方法Beautiful Soup 安装 pip install beautifulsoup4 标准库有html.parser解析器但速度不是很快一般 ...
vue mand-mobile按2.0文档默认安装的是1.6.8版本
vue mand-mobile按2.0文档默认安装的是1.6.8版本 npm list mand-mobilebigbullmobile@1.0.0 E:\webcode\bigbullmobile` ...
css2.0文档查阅及字体样式
css2.0文档查阅下载网址:http://soft.hao123.com/soft/appid/9517.html <html xmlns="http://www.w3.o ...
Beautiful Soup 4.2.0
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式快速开始 pip install beaut ...
Django2.0文档
第四章模板 1.标签 (1)if/else {% if %} 标签检查(evaluate)一个变量,如果这个变量为真(即,变量存在,非空,不是布尔值假),系统会显示在 {% if %} 和 {% e ...
11.Django2.0文档
第四章模板 1.标签 (1)if/else {% if %} 标签检查(evaluate)一个变量,如果这个变量为真(即,变量存在,非空,不是布尔值假),系统会显示在 {% if %} 和 {% e ...

随机推荐

length属性、length()方法和size()的方法的区别
JAVA 1. length属性是针对Java中的数组来说的,要求数组的长度可以用其length属性: 2.length()方法是针对字符串来说的,要求一个字符串的长度就要用到它的length()方法 ...
go 学习笔记之学习函数式编程前不要忘了函数基础
在编程世界中向来就没有一家独大的编程风格,至少目前还是百家争鸣的春秋战国,除了众所周知的面向对象编程还有日渐流行的函数式编程,当然这也是本系列文章的重点. 越来越多的主流语言在设计的时候几乎无一例外都 ...
FreeSql （十五）查询数据
FreeSql在查询数据下足了功能,链式查询语法.多表查询.表达式函数支持得非常到位. IFreeSql fsql = new FreeSql.FreeSqlBuilder() .UseConnect ...
CSAPP DataLab
注意不同版本的题目可能会有所不同,搜了很多他们的题目和现在官网给的实验题都不一样,自己独立思考完整做一遍顺便记录一下. PS:这些难度为1的题有的说实话我都做了挺久的,不过完整做一遍感觉很有意思,这些 ...
gallery的简单使用方法
Gallery意思是"画廊",就是一个水平可左右拖动的列表,里面可以放置多个图片,经常和ImageSwitcher一起使用在使用ImageSwitcher时,需要传入一个View ...
YQL获取天气
$(function () { $.getJSON("http://query.yahooapis.com/v1/public/yql?callback=?", { q: &quo ...
jdk1.8源码阅读
一.java.lang java的基础类 1.object 所有类的爸爸 registerNatives() Class<?> getClass():返回运行时的类 int hashCod ...
Linux初识之Centos7中terminal光标位置偏移问题的解决
新安装的centos7打开terminal发现光标位置向右偏移,使用起来影响感官,经查询后找到类似情况并顺利解决问题,特记录解决过程以作参考. 1.未解决时光标向右偏移显示: 2.打开设置(Setti ...
selenium-01-2环境搭建
首先下载好Eclipse 和配置好Java 环境变量步骤省略, 请百度方法一添加jar包官方下载地址: http://www.seleniumhq.org/download/ 官方地址 ...
redis常用操作-键的生存时间
System.out.println("设置 key001的过期时间为5秒:"+jedis.expire("key001", 5)); System.out.p ...

Beautiful Soup 4.2.0 文档（一）