pythn BeautifulSoup

http://rsj217.diandian.com/post/2012-11-01/40041235132

Beautiful Soup 是用 Python 写的一个 HTML/XML 的解析器，它可以很好的处理不规范标记并生成剖析树。通常用来分析爬虫抓取的web文档。对于不规则的 Html文档，也有很多的补全功能，节省了开发者的时间和精力。

Beautiful Soup 的官方文档齐全，将官方给出的例子实践一遍就能掌握。官方英文文档，中文文档

一安装 Beautiful Soup

安装 BeautifulSoup 很简单，下载 BeautifulSoup 源码。解压运行

python setup.py install 即可。

测试安装是否成功。键入 import BeautifulSoup 如果没有异常，即成功安装

二使用 BeautifulSoup

1. 导入BeautifulSoup ，创建BeautifulSoup 对象

from BeautifulSoup import BeautifulSoup # HTML

from BeautifulSoup import BeautifulStoneSoup # XML

import BeautifulSoup # ALL

doc = [

'<html><head><title>Page title</title></head>',

'<body>This is paragraph one.',

'This is paragraph two.',

'</html>'

]

# BeautifulSoup 接受一个字符串参数

soup = BeautifulSoup(''.join(doc))

2. BeautifulSoup对象简介

用BeautifulSoup 解析 html文档时，BeautifulSoup将 html文档类似 dom文档树一样处理。BeautifulSoup文档树有三种基本对象。

2.1. soup BeautifulSoup.BeautifulSoup

1 2	`type(soup)` `<class` `'BeautifulSoup.BeautifulSoup'>`

2.2. 标记 BeautifulSoup.Tag

1 2	`type(soup.html)` `<class` `'BeautifulSoup.Tag'>`

2.3 文本 BeautifulSoup.NavigableString

1 2	`type(soup.title.string)` `<class` `'BeautifulSoup.NavigableString'>`

3. BeautifulSoup 剖析树

3.1 BeautifulSoup.Tag对象方法

获取标记对象（Tag）

标记名获取法，直接用 soup对象加标记名，返回 tag对象.这种方式，选取唯一标签的时候比较有用。或者根据树的结构去选取，一层层的选择

>>> html = soup.html

>>> html

<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is paragraphone.<p id="secondpara" align="blah">This is paragraphtwo.</body></html>

>>> type(html)

<class 'BeautifulSoup.Tag'>

>>> title = soup.title

<title>Page title</title>

content方法

content方法根据文档树进行搜索，返回标记对象（tag）的列表

1 2	`>>> soup.contents` `[<html><head><title>Page title</title></head><body><p` `id="firstpara"` `align="center">This` `is` `paragraph<b>one</b>.</p><p` `id="secondpara"` `align="blah">This` `is` `paragraph<b>two</b>.</p></body></html>]`

>>> soup.contents[0].contents

[<head><title>Page title</title></head>, <body><p id="firstpara" align="center">This is paragraphone.<p id="secondpara" align="blah">This is paragraphtwo.</body>]

>>> len(soup.contents[0].contents)

2

>>> type(soup.contents[0].contents[1])

<class 'BeautifulSoup.Tag'>

使用contents向后遍历树，使用parent向前遍历树

next 方法

获取树的子代元素，包括 Tag 对象和 NavigableString 对象。。。

>>> head.next

<title>Page title</title>

>>> head.next.next

u'Page title'

>>> p1 = soup.p

>>> p1

<p id="firstpara" align="center">This is paragraphone.

>>> p1.next

u'This is paragraph'

nextSibling 下一个兄弟对象包括 Tag 对象和 NavigableString 对象

>>> head.nextSibling

<body><p id="firstpara" align="center">This is paragraphone.<p id="secondpara" align="blah">This is paragraphtwo.</body>

>>> p1.next.nextSibling

one

与 nextSibling 相似的是 previousSibling，即上一个兄弟节点。

replacewith方法

将对象替换为，接受字符串参数

>>> head = soup.head

>>> head

<head><title>Page title</title></head>

>>> head.parent

<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is paragraphone.<p id="secondpara" align="blah">This is paragraphtwo.</body></html>

>>> head.replaceWith('head was replace')

>>> head

<head><title>Page title</title></head>

>>> head.parent

>>> soup

<html>head was replace<body><p id="firstpara" align="center">This is paragraphone.<p id="secondpara" align="blah">This is paragraphtwo.</body></html>

>>>

搜索方法

搜索提供了两个方法，一个是 find，一个是findAll。这里的两个方法(findAll和 find)仅对Tag对象以及，顶层剖析对象有效，但 NavigableString不可用。

`findAll(`name, attrs, recursive, text, limit, **kwargs)

接受一个参数，标记名

寻找文档所有 P标记，返回一个列表

>>> soup.findAll('p')

[<p id="firstpara" align="center">This is paragraphone., <p id="secondpara" align="blah">This is paragraphtwo.]

>>> type(soup.findAll('p'))

<type 'list'>

寻找 id="secondpara"的 p 标记，返回一个结果集

>>> pid = type(soup.findAll('p',id='firstpara'))

>>> pid

<class 'BeautifulSoup.ResultSet'>

传一个属性或多个属性对

>>> p2 = soup.findAll('p',{'align':'blah'})

>>> p2

[<p id="secondpara" align="blah">This is paragraphtwo.]

>>> type(p2)

<class 'BeautifulSoup.ResultSet'>

利用正则表达式

1 2	`>>> soup.findAll(id=re.compile("para$"))` `[<p` `id="firstpara"` `align="center">This` `is` `paragraph<b>one</b>.</p>, <p` `id="secondpara"` `align="blah">This` `is` `paragraph<b>two</b>.</p>]`

读取和修改属性

>>> p1 = soup.p

>>> p1

<p id="firstpara" align="center">This is paragraphone.

>>> p1['id']

u'firstpara'

>>> p1['id'] = 'changeid'

>>> p1

<p id="changeid" align="center">This is paragraphone.

>>> p1['class'] = 'new class'

>>> p1

<p id="changeid" align="center" class="new class">This is paragraphone.

>>>

剖析树基本方法就这些，还有其他一些，以及如何配合正则表达式。具体请看官方文档

3.2 BeautifulSoup.NavigableString对象方法

NavigableString 对象方法比较简单，获取其内容

>>> soup.title

<title>Page title</title>

>>> title = soup.title.next

>>> title

u'Page title'

>>> type(title)

<class 'BeautifulSoup.NavigableString'>

>>> title.string

u'Page title'

至于如何遍历树，进而分析文档，已经 XML 文档的分析方法，可以参考官方文档。

pythn BeautifulSoup的更多相关文章

Python爬虫小白入门（三）BeautifulSoup库
# 一.前言 *** 上一篇演示了如何使用requests模块向网站发送http请求,获取到网页的HTML数据.这篇来演示如何使用BeautifulSoup模块来从HTML文本中提取我们想要的数据. ...
使用beautifulsoup与requests爬取数据
1.安装需要的库 bs4 beautifulSoup requests lxml如果使用mongodb存取数据,安装一下pymongo插件 2.常见问题 1> lxml安装问题如果遇到lxm ...
BeautifulSoup ：功能使用
# -*- coding: utf-8 -*- ''' # Author : Solomon Xie # Usage : 测试BeautifulSoup一些用法及容易出bug的地方 # Envirom ...
BeautifulSoup研究一
BeautifulSoup的文档见 https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/ 其中.contents 会将换行也记录为一个子节 ...
BeautifulSoup
参考:http://www.freebuf.com/news/special/96763.html 相关资料:http://www.jb51.net/article/65287.htm 1.Pytho ...
BeautifulSoup Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
BeautifulSoup很赞的东西最近出现一个问题:Python 3.3 soup=BeautifulSoup(urllib.request.urlopen(url_path),"htm ...
beautifulSoup(1)
import re from bs4 import BeautifulSoupdoc = ['<html><head><title>Page title</t ...
python BeautifulSoup模块的简要介绍
常用介绍: pip install beautifulsoup4 # 安装模块 from bs4 import BeautifulSoup # 导入模块 soup = BeautifulSoup(ht ...
BeautifulSoup 的用法
转自:http://cuiqingcai.com/1319.html Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,如果我们不安装它,则 Python ...

随机推荐

ms mpi error: unable to allocate launching block
问题描述: 在VS 2015中使用Microsoft MPI(ms mpi)构建控制台应用,使用" mpiexec -n 4 myprog.exe"运行时退出并提示"un ...
UIView中常见的方法总结
addSubview: 添加一个子视图到接收者并让它在最上面显示出来.- (void)addSubview:(UIView *)view[讨论]这方法同样设置了接收者为下一个视图响应对象.接收者保留视 ...
在button中加入一个view图片
#import "ViewController.h" @interface ViewController () @end @implementation ViewControlle ...
.net core demo & docker images
记录.net core 部署在docker 上的大概步骤便于以后查阅. PART 1 .net core web api demo 1.下载最新VS 2015 community 社区版免费使用. 2 ...
Libcurl笔记一
一:1,全局初始化及释放:CURLcode curl_global_init(long flags) flags: CURL_GLOBAL_ALL //初始化所有的可能的调用. CURL_GLOBAL ...
利用Echarts设计一个图表平台（一）
Echarts是一款百度的开源图表库,里面提供了非常多的图表样式,我们今天要讲的内容是利用这一款开源js图表,制作一个能够动态定制的图表平台. 1)Echarts API介绍首先我们先来看一下Ech ...
小技巧：SystemTray中进行操作提示
SystemTray中进行操作提示在wp中应用比较广泛,截图如下. 实现方法也十分简单 1.xaml代码中写入: shell:SystemTray.IsVisible="True" ...
MySQL忘记密码后重置密码（Mac ）
安装好MySQL以后,系统给了个默认的的密码,然后说如果忘记了默认的密码......我复制了默认密码就走过了这一步,这一步就是我漫长旅程的开始.他给的密码太复杂了,当然我得换一个,而且我还要假装我不记 ...
Java数字格式化输出时前面补0
Java数字格式化输出时前面补0 星期日 2014年11月30日| 分类: Java /** * 里数字转字符串前面自动补0的实现. * */ public class TestString ...
WinForms 小型HTML服务器
最近教学,使用到了Apache和IIS,闲着无聊,有种想自己写个小服务器的冲动. 在网上找了半天的资料,最后终于搞定了,测试可以访问.效果图如下: 因为只是处理简单的请求,然后返回请求的页面,所以没有 ...

pythn BeautifulSoup

findAll(name, attrs, recursive, text, limit, **kwargs)

pythn BeautifulSoup的更多相关文章

随机推荐

热门专题

`findAll(`name, attrs, recursive, text, limit, **kwargs)