快速开始使用BeautifulSoup

首先创建一个我们需要解析的html文档，这里采用官方文档里面的内容：

html_doc = """

<html><head><title>The Dormouse's story</title></head>

<body>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class="story">...</p>

"""

要解析这段代码，需要导入BeautifullSoup，可以选择按照标准的缩进格式来输出内容：

from bs4 import BeautifulSoup#导入BeautifulSoup的方法

#可以传入一段字符串，或者传入一个文件句柄。一般都会先用requests库获取网页内容，然后使用soup解析。

soup=BeautifulSoup(html_doc,'html.parser')#这里一定要指定解析器，可以使用默认的html，也可以使用lxml比较快。

print(soup.prettify())#按照标准的缩进格式输出获取的soup内容。

<html>

 <head>

  <title>

   The Dormouse's story

  </title>

 </head>

 <body>

  <p class="title">

   <b>

    The Dormouse's story

   </b>

  </p>

  <p class="story">

   Once upon a time there were three little sisters; and their names were

   <a class="sister" href="http://example.com/elsie" id="link1">

    Elsie

   </a>

   ,

   <a class="sister" href="http://example.com/lacie" id="link2">

    Lacie

   </a>

   and

   <a class="sister" href="http://example.com/tillie" id="link3">

    Tillie

   </a>

   ;

and they lived at the bottom of a well.

  </p>

  <p class="story">

   ...

  </p>

 </body>

</html>

#几种简单浏览结构化数据的方法：

print(soup.title)#获取文档的title

print(soup.title.name)#获取title的name属性

print(soup.title.string)#获取title的内容

print(soup.title.parent.name)#获取title的parent名称,也就是head,上一级.

print(soup.p)#获取文档中第一个p节点

print(soup.p['class'])#获取第一个p节点的class内容

print(soup.a)#获取文档的第一个a节点

print(soup.find_all('a'))#获取文档中所有的a节点,返回一个list

soup.find(id='link3')#获取文档中id属性为link3的节点

<title>The Dormouse's story</title>

title

The Dormouse's story

head

<p class="title"><b>The Dormouse's story</b></p>

['title']

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

for link in soup.find_all('a'):

    print(link.get('href'))#获取a节点的href属性

http://example.com/elsie

http://example.com/lacie

http://example.com/tillie

#print(soup.get_text())

print(soup.text)#两种方式都可以返回获取的所有文本

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie,

Lacie and

Tillie;

and they lived at the bottom of a well.

...

对象的种类

其实HTML文档包含了很多的节点，这些节点一般可以归纳为4类，Tag，NavigableString，BeautifulSoup，Comment。

Tag

Tag就是html文档中的一个个标签。

主要介绍Tag的name和attributes属性。

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>','html.parser')

tag = soup.b

type(tag)

bs4.element.Tag

#Name

#属性通过.name来获取

#如果改变tag的name，那么所有当前BS对象的HTML文档都会改变。

print(tag.name)

tag.name='blockquote'

print(tag)

b

<blockquote class="boldest">Extremely bold</blockquote>

#Attributes

#获取方法比较简单，直接使用tag['attr_name']即可

#或者直接tag.attrs，可以返回所有的属性组成的字典。

print(tag['class'])

print(tag.attrs)

#tag的属性可以被删除或者修改，添加，与字典的操作方式一样

tag['class']='verybold'

tag['id']=1

print(tag)

#删除Tag的属性使用del方法

del tag['id']

verybold

{'id': 1, 'class': 'verybold'}

<blockquote class="verybold" id="1">Extremely bold</blockquote>

#有时候一个属性可能存在多个值，比如class，那么就会返回一个list

css_soup = BeautifulSoup('<p class="body strikeout"></p>','html.parser')

print(css_soup.p['class'])

['body', 'strikeout']

#将tag转换成字符串时，多值属性会合并为一个值；

rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>','html.parser')

rel_soup.a['rel'] = ['index', 'contents']

print(rel_soup.p)

#xml格式的文档不包含多值属性。

<p>Back to the <a rel="index contents">homepage</a></p>

NavigableString

可以遍历的字符串。

字符串通常被包含在tag内，BS用NavigableString类来包装tag中的字符串。

print(type(tag.string))#也就是tag的字符内容，<>string<>

<class 'bs4.element.NavigableString'>

#tag中的字符串不能编辑，但是可以替换成其他字符串

tag.string.replace_with('No longer bold')

print(tag)

<blockquote class="verybold">No longer bold</blockquote>

遍历文档数

还拿之前的html_doc来举例，演示如何从一段内容找到另一段内容。

soup=BeautifulSoup(html_doc,'html.parser')

子节点

一个tag可能包含多个字符串或者其他tag，都是这个tag的子节点。

#如果想要获取当前名字的第一个tag，直接用.tag_name就可以实现

print(soup.a)

#如果想要获取当前名字的所有tag，需要用find_all('tag_name')才可以

print(soup.find_all('a'))

#tag的.contents属性可以将tag的子节点以-列表-的方式输出

head_tag=soup.head

print(head_tag.contents)

#通过tag的.children生成器，可以对tag的子节点进行循环,(直接子节点)

for child in head_tag.children:

    print(child)

#.descendants属性可以对所有tag的子孙节点进行递归循环：

for child in head_tag.descendants:

    print(child)

#.string属性，如果tag只有一个NavigableString类型子节点，那么这个tag可以使用.string得到子节点：

#如果包含多个子节点，tag就无法确定.string的方法应该调用哪个子节点，所以输出None。

#如果tag中包含多个字符串，可以用.strings来循环获取，输出的字符串可能包含多个空格或空行，

#使用.stripped_strings可以去除多余空白内容。

for string in soup.stripped_strings:

    print(repr(string))

父节点

每个tag或字符串都由父节点，也就是包含在某个tag中。

#.parent属性，用于获取某个元素的父节点，比如：

title_tag=soup.title

print(title_tag.parent)

#文档title的字符串也有父节点，title标签

#.parents，可以遍历tag到根节点的所有节点。

<head><title>The Dormouse's story</title></head>

兄弟节点

一段文档以标准格式输出时,兄弟节点有相同的缩进级别.

.next_sibling和.previous_sibling属性，用来查询兄弟节点：

.next_siblings和.previous_siblings属性，可以对当前节点的兄弟节点迭代输出。

回退和前进

.next_element 和 .previous_element属性指向解析过程中的下一个或者上一个解析对象。

.next_elements 和 .previous_elements属性，上或者下解析内容，列表。

搜索文档树

查找解析文档中的标签节点

#1、传入字符串

soup.find_all('b')#查找所有<b>标签

#2、正则表达式

import re

for tag in soup.find_all(re.compile('^b')):

    print(tag.name)

#3、传入列表参数

soup.find_all(['a','b'])#查找所有的<a><b>标签

#4、True参数，可以匹配任何值，

#5、如果没有合适的过滤器，还可以定义一个方法，方法只接受一个元素参数，如果这个方法返回True，表示匹配到元素

def has_class_but_no_id(tag):

    return tag.has_attr('class') and not tag.has_attr('id')

#可以将上面方法传入find_all()方法

soup.find_all(has_class_but_no_id)

#通过一个方法来过滤一类-标签属性-的时候，这个方法的参数是要被过滤的属性的值，而不是这个标签。

def not_lacie(href):

    return href and not re.compile('lacie').search(href)

soup.find_all(href=not_lacie)#找出href属性不符合指定正则的标签。

find_all()方法

find_all( name , attrs , recursive , string , **kwargs )

搜索当前tag的所有子节点，并且判断是否符合过滤器的条件

#name参数，查找所有名字为name的tag，字符串对象被忽略。

soup.find_all('title')

#keyword参数，如果一个指定名字的参数不是搜索内置的参数名，搜索时会把该参数当作指定名字tag的属性来搜索。

soup.find_al(id='link2')

soup.find_all(href=re.compile('elsie'))

#如果多个指定名字的参数可以同时过滤tag的多个属性：

soup.find_all(href=re.compile('elsie'),id='link1')

#有些tag属性在搜索不能使用，比如HTML5中的data*属性，但是可以通过find_all()的attrs参数定义一个字典来搜索：

data_soup.find_all(attrs={'data-foo':'value'})

按css搜索

#BS4.1开始，可以通过class_参数搜索具有指定css类名的tag：

soup.find_all('a',class_='sister')

#接受通过类型的过滤器，比如正则表达式

soup.find_all(class_=re.compile('it1'))

string参数

soup.find_all(string='Elsie')

limit参数

可以用来限制返回结果的数量

recursive参数

如果指向搜索tag的直接子节点，可以使用参数recursive=False。

像调用find_all()一样来调用tag

每个tag对象可以被当作一个方法来使用，与调用find_all()方法相同。

soup.find_all('a')

soup('a')#这两句代码时等价的

find()方法

与find_all()相同的用法，但是只能返回一个结果。

CSS选择器，select方法

soup.select('title')#选择title标签

soup.select('p nth-of-type(3)')

#通过tag标签逐层查找

soup.select('body a')#查找body标签下面的a标签

#找到某个tag标签下的直接子标签：

soup.select('head>title')

#通过id来查找：

soup.select('#link1')

#通过class来查找：

soup.select('.sister')

soup.select('[class~=sister]')

#通过是否存在某个属性来查找：

soup.select('a[href]')

#通过属性的值来查找：

soup.select('a[href="http://www.baidu.com"]')

如果您觉得感兴趣的话，可以添加我的微信公众号：一步一步学Python

爬虫入门【3】BeautifulSoup4用法简介的更多相关文章

【爬虫入门01】我第一只由Reuests和BeautifulSoup4供养的Spider
[爬虫入门01]我第一只由Reuests和BeautifulSoup4供养的Spider 广东职业技术学院欧浩源 1.引言网络爬虫可以完成传统搜索引擎不能做的事情,利用爬虫程序在网络上取得数据 ...
Python爬虫入门之Urllib库的高级用法
1.设置Headers 有些网站不会同意程序直接用上面的方式进行访问,如果识别有问题,那么站点根本不会响应,所以为了完全模拟浏览器的工作,我们需要设置一些Headers 的属性. 首先,打开我们的浏览 ...
Python爬虫入门四之Urllib库的高级用法
1.设置Headers 有些网站不会同意程序直接用上面的方式进行访问,如果识别有问题,那么站点根本不会响应,所以为了完全模拟浏览器的工作,我们需要设置一些Headers 的属性. 首先,打开我们的浏览 ...
转 Python爬虫入门四之Urllib库的高级用法
静觅 » Python爬虫入门四之Urllib库的高级用法 1.设置Headers 有些网站不会同意程序直接用上面的方式进行访问,如果识别有问题,那么站点根本不会响应,所以为了完全模拟浏览器的工作,我 ...
爬虫入门四 re
title: 爬虫入门四 re date: 2020-03-14 16:49:00 categories: python tags: crawler 正则表达式与re库 1 正则表达式简介编译原理学 ...
爬虫入门二 beautifulsoup
title: 爬虫入门二 beautifulsoup date: 2020-03-12 14:43:00 categories: python tags: crawler 使用beautifulsou ...
Python简单爬虫入门三
我们继续研究BeautifulSoup分类打印输出 Python简单爬虫入门一 Python简单爬虫入门二前两部主要讲述我们如何用BeautifulSoup怎去抓取网页信息以及获取相应的图片标题等信 ...
Python爬虫入门一之综述
大家好哈,最近博主在学习Python,学习期间也遇到一些问题,获得了一些经验,在此将自己的学习系统地整理下来,如果大家有兴趣学习爬虫的话,可以将这些文章作为参考,也欢迎大家一共分享学习经验. Pyth ...
Docker新手入门：基本用法
Docker新手入门:基本用法 1.Docker简介 1.1 第一本Docker书工作中不断碰到Docker,今天终于算是正式开始学习了.在挑选系统学习Docker以及虚拟化技术的书籍时还碰到了不少 ...

随机推荐

OPENDJ的安装图文说明
一. 说明介绍: opendj是一个ldap服务器用于存储openam的配置和用户存储信息准备工具: OpenDJ-3.0.0.zip 二. 安装步骤 a) Linux安装过程 1. 将zip包 ...
Tomcat：Java Web服务器配置详解
一.Tomcat概述 1.tomcat简介 tomcat是基于JDK的web服务器,其能运行Servlet和JSP规范总.Tomcat 5支持最新的Servlet 2.4 和JSP 2.0规范.实际上 ...
Python开发easy忽略的问题
这篇文章主要介绍了Python程序猿代码编写时应该避免的17个"坑",也能够说成Python程序猿代码编写时应该避免的17个问题,须要的朋友能够參考下一.不要使用可变对象作为函数 ...
查找文件命令find总结以及查找大文件
find / -name *** 示例如下: [dinpay@zk-spark-01 spark]$ find /home/ll -name slaves /home/ll/spark/conf/sl ...
2017.4.12 开涛shiro教程-第十八章-并发登录人数控制
原博客地址:http://jinnianshilongnian.iteye.com/blog/2018398 根据下载的pdf学习. 开涛shiro教程-第十八章-并发登录人数控制 shiro中没有提 ...
Elasticsearch 基础使用
使用 cURL 执行 REST 命令可以对 Elasticsearch 发出 cURL 请求,这样很容易从命令行 shell 体验该框架. “Elasticsearch 是无模式的.它可以接受您提供 ...
vscode - emmet失效？
把emmet设置覆盖为用户.
打造你的前端神器-webstorm11
说起前端编辑器,用过dw,sublime,hbuilder,webstorm也不陌生,之前的版本8有用过一下,但是觉得比sublime重量太多,但是随着后来用node的开始,发现需要打造个web前端神 ...
lodash 数组元素查找 findIndex
_.findIndex(array, [predicate=_.identity]) 这个方法类似 _.find.除了它返回最先通过 predicate 判断为真值的元素的 index ,而不是元素本 ...
vue created 生命周期
在实例创建完成后被立即调用.在这一步,实例已完成以下的配置:数据观测 (data observer),属性和方法的运算,watch/event 事件回调.然而,挂载阶段还没开始,$el属性目前不可见. ...

爬虫入门【3】BeautifulSoup4用法简介