9.3.4 BeaufitulSoup4

　　BeautifulSoup 是一个非常优秀的Python扩展库，可以用来从HTML或XML文件中提取我们感兴趣的数据，并且允许指定使用不同的解析器。

　　使用 pip install BeaufifulSoup4 直接进行模块的安装。安装之后应使用 from bs4 import BeautifulSoup 导入并使用。

　　下面简单演示下BeautifulSoup4的功能，更加详细完整的学习资料请参考 https://www.crummy.com/software/BeautifulSoup/bs4/doc/。

 >>> from bs4 import BeautifulSoup

 >>>

 >>> #自动添加和补全标签

 >>> BeautifulSoup('hello world','lxml')

 <html><body><p>hello world</p></body></html>

 >>>

 >>> #自定义一个html文档内容

 >>> html_doc = """

 <html><head><title>The Dormouse's story</title></head>

 <body>

 <p class="title"><b>The Dormouse's story</b></p>

 <p class="story">Once upon a time there were three little sisters;and their names were

 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and

 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

 and they lived at the bottom of a well.</p>

 <p class="story">...</p>

 """

 >>>

 >>> #解析这段html文档内容，以优雅的方式展示出来

 >>> soup = BeautifulSoup(html_doc,'html.parser')

 >>> print(soup.prettify())

 <html>

  <head>

   <title>

    The Dormouse's story

   </title>

  </head>

  <body>

   <p class="title">

    <b>

     The Dormouse's story

    </b>

   </p>

   <p class="story">

    Once upon a time there were three little sisters;and their names were

    <a class="sister" href="http://example.com/elsie" id="link1">

     Elsie

    </a>

    ,

    <a class="sister" href="http://example.com/lacie" id="link2">

     Lacie

    </a>

    and

    <a class="sister" href="http://example.com/tillie" id="link3">

     Tillie

    </a>

    ;

 and they lived at the bottom of a well.

   </p>

   <p class="story">

    ...

   </p>

  </body>

 </html>

 >>>

 >>> #访问特定标签

 >>> soup.title

 <title>The Dormouse's story</title>

 >>>

 >>> #标签名字

 >>> soup.title.name

 'title'

 >>>

 >>> #标签文本

 >>> soup.title.text

 "The Dormouse's story"

 >>>

 >>> #title标签的上一级标签

 >>> soup.title.parent

 <head><title>The Dormouse's story</title></head>

 >>>

 >>> soup.head

 <head><title>The Dormouse's story</title></head>

 >>>

 >>> soup.b

 <b>The Dormouse's story</b>

 >>>

 >>> soup.b.name

 'b'

 >>> soup.b.text

 "The Dormouse's story"

 >>>

 >>> #把整个BeautifulSoup对象看作标签对象

 >>> soup.name

 '[document]'

 >>>

 >>> soup.body

 <body>

 <p class="title"><b>The Dormouse's story</b></p>

 <p class="story">Once upon a time there were three little sisters;and their names were

 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and

 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

 and they lived at the bottom of a well.</p>

 <p class="story">...</p>

 </body>

 >>>

 >>> soup.p

 <p class="title"><b>The Dormouse's story</b></p>

 >>>

 >>> #标签属性

 >>> soup.p['class']

 ['title']

 >>>

 >>> soup.p.get('class')         #也可以这样查看标签属性

 ['title']

 >>>

 >>> soup.p.text

 "The Dormouse's story"

 >>>

 >>> soup.p.contents

 [<b>The Dormouse's story</b>]

 >>>

 >>> soup.a

 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

 >>>

 >>> #查看a标签所有属性

 >>> soup.a.attrs

 {'class': ['sister'], 'id': 'link1', 'href': 'http://example.com/elsie'}

 >>>

 >>> #查找所有a标签

 >>> soup.find_all('a')

 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

 >>>

 >>> #同时查找<a>和<b>标签

 >>> soup.find_all(['a','b'])

 [<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

 >>>

 >>> import re

 >>> #查找href包含特定关键字的标签

 >>> soup.find_all(href=re.compile("elsie"))

 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

 >>>

 >>> soup.find(id='link3')

 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

 >>>

 >>> soup.find_all('a',id='link3')

 [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

 >>>

 >>> for link in soup.find_all('a'):

     print(link.text,':',link.get('href'))

 Elsie : http://example.com/elsie

 Lacie : http://example.com/lacie

 Tillie : http://example.com/tillie

 >>>

 >>> print(soup.get_text())           #返回所有文本

 The Dormouse's story

 The Dormouse's story

 Once upon a time there were three little sisters;and their names were

 Elsie,

 Lacieand

 Tillie;

 and they lived at the bottom of a well.

 ...

 >>>

 >>> #修改标签属性

 >>> soup.a['id']='test_link1'

 >>> soup.a

 <a class="sister" href="http://example.com/elsie" id="test_link1">Elsie</a>

 >>>

 >>> #修改标签文本

 >>> soup.a.string.replace_with('test_Elsie')

 'Elsie'

 >>>

 >>> soup.a.string

 'test_Elsie'

 >>>

 >>> print(soup.prettify())

 <html>

  <head>

   <title>

    The Dormouse's story

   </title>

  </head>

  <body>

   <p class="title">

    <b>

     The Dormouse's story

    </b>

   </p>

   <p class="story">

    Once upon a time there were three little sisters;and their names were

    <a class="sister" href="http://example.com/elsie" id="test_link1">

     test_Elsie

    </a>

    ,

    <a class="sister" href="http://example.com/lacie" id="link2">

     Lacie

    </a>

    and

    <a class="sister" href="http://example.com/tillie" id="link3">

     Tillie

    </a>

    ;

 and they lived at the bottom of a well.

   </p>

   <p class="story">

    ...

   </p>

  </body>

 </html>

 >>>

 >>>

 >>> #遍历子标签

 >>> for child in soup.body.children:

     print(child)

 <p class="title"><b>The Dormouse's story</b></p>

 <p class="story">Once upon a time there were three little sisters;and their names were

 <a class="sister" href="http://example.com/elsie" id="test_link1">test_Elsie</a>,

 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and

 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

 and they lived at the bottom of a well.</p>

 <p class="story">...</p>

 >>>

9.3.4 BeaufitulSoup4的更多相关文章

随机推荐

c++ 编译器会绕过拷贝构造函数
C++ primer P442 P447:在拷贝初始化过程中,编译器可以跳过拷贝构造函数,直接创建对象.即,编译器允许将下面的代码 "; //1 改写为 "); //2 由于str ...
STM32：TIMER输出比较模式-PWM
在自己小板子上移植PWM时候又重新学习了一下,加入两点:1,对各种输出比较模式的学习:2,输出模式时加入中断先写出函数: //TIM4 PWM部分初始化 //PWM输出初始化 //period:输出 ...
IDEA新项目代码上传到gitlab远程仓库
IDEA新项目代码上传到gitlab远程仓库具体步骤创建本地仓库 IDEA:VCS-->Import into Version Control-->Create Git Reposit ...
CentOS 7.0 firewall防火墙关闭firewall作为防火墙，这里改为iptables防火墙
CentOS 7.0默认使用的是firewall作为防火墙,这里改为iptables防火墙步骤: 1.先检查是否安装了: iptables service iptables status 2.安装ip ...
BZOJ 4140 凸包+二进制分组
思路: $(x_0-x)^2+(y_0-y)^2<=x^2+y^2$ $y>=(-x_0/y_0)x+(x_0^2+y_0^2)/2y0$ 这显然就是凸包了以一个斜率不断向下(上)走 ...
C#用Microsoft.Office.Interop.Word进行Word转PDF的问题
之前用Aspose.Word进行Word转PDF发现'\'这个字符会被转换成'¥'这样的错误,没办法只能换个方法了.下面是Microsoft.Office.Interop.Word转PDF的方法: p ...
iOS动画——CoreAnimation
CoreAnimation在我之前的UIKit动画里面简单的提了一句CoreAnimation动画,其实大家别看它类库名种有个animation,实际上animation在这个库中只占有很小的地位. ...
yield让代码更加简洁
不能传入out或ref public IEnumerable<Shop> GetShop() { ; i < ; i++) { yield return new Shop { ID ...
[转]STL之list容器详解
List 容器 list是C++标准模版库(STL,Standard Template Library)中的部分内容.实际上,list容器就是一个双向链表,可以高效地进行插入删除元素. 使用list容 ...
高级Java知识
高级Java知识(JVM.字节码.内存模型) 内存=方法区+栈空间+堆+程序计数器栈(stack)包括虚拟机栈(VM stack)和本地方法栈(native method stack). 方法区(m ...

9.3.4 BeaufitulSoup4

9.3.4 BeaufitulSoup4的更多相关文章

随机推荐

热门专题