爬虫——BeautifulSoup4解析器
BeautifulSoup用来解析HTML比较简单,API非常人性化,支持CSS选择器、Python标准库中的HTML解析器,也支持lxml的XML解析器。
其相较与正则而言,使用更加简单。
示例:
首先必须要导入bs4库
#!/usr/bin/python3
# -*- coding:utf-8 -*-
__author__ = 'mayi' from bs4 import BeautifulSoup html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" # 创建 Beautiful Soup 对象,指定lxml解析器
soup = BeautifulSoup(html, "lxml") # 格式化输出 soup 对象的内容
print(soup.prettify())
运行结果
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title" name="dromouse">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<!-- Elsie -->
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
四大对象种类
BeautifulSoup将复杂的HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:
- Tag
- NavigableString
- BeautifulSoup
- Comment
1.Tag
Tag 通俗点讲就是HTML中的一个个标签,例如:
<head><title>The Dormouse's story</title></head>
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
上面title head a p 等等HTML标签加上里面包括的内容就是Tag,那么试着使用BeautifulSoup来获取Tags:
#!/usr/bin/python3
# -*- coding:utf-8 -*-
__author__ = 'mayi' from bs4 import BeautifulSoup html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" # 创建 Beautiful Soup 对象,指定lxml解析器
soup = BeautifulSoup(html, "lxml") # # 打印title标签
print(soup.title) # 打印head标签
print(soup.head) # 打印a标签
print(soup.a) # 打印p标签
print(soup.p) # 打印soup.p的类型
print(type(soup.p))
运行结果
<title>The Dormouse's story</title>
<head><title>The Dormouse's story</title></head>
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<class 'bs4.element.Tag'>
我们可以利用soup加标签名轻松地获取这些标签内容,这些对象的类型是bs4.element.Tag。但是注意,它查找的是在所有内容中的第一个符合要求的标签。如果需要查询所有的标签,后面会进行介绍。
对于Tag,它有两个重要的属性,就是name和attrs
#!/usr/bin/python3
# -*- coding:utf-8 -*-
__author__ = 'mayi' from bs4 import BeautifulSoup html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" # 创建 Beautiful Soup 对象,指定lxml解析器
soup = BeautifulSoup(html, "lxml") # soup对象比较特殊,它的name为[document]
print(soup.name) # 对于其他内部标签,输出的值便为标签本身的名称
print(soup.head.name) # 打印p标签的所有属性,其类型是一个字典
print(soup.p.attrs) # 打印p标签的class属性
print(soup.p['class'])
# 还可以利用get方法获取属性,传入属性的名称,与上面的方法等价
print(soup.p.get('class')) print(soup.p) # 修改属性
soup.p['class'] = "newClass"
print(soup.p) # 删除属性
del soup.p['class']
print(soup.p)
运行结果
[document]
head
{'class': ['title'], 'name': 'dromouse'}
['title']
['title']
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>
<p name="dromouse"><b>The Dormouse's story</b></p>
2.NavigableString
既然我们已经得到了标签的内容,那么问题来了,我们想要获取标签内部的文字怎么办呢?很简单,用.string即可,例如:
#!/usr/bin/python3
# -*- coding:utf-8 -*-
__author__ = 'mayi' from bs4 import BeautifulSoup html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" # 创建 Beautiful Soup 对象,指定lxml解析器
soup = BeautifulSoup(html, "lxml") # 打印p标签的内容
print(soup.p.string) # 打印soup.p.string的类型
print(type(soup.p.string))
运行结果
The Dormouse's story
<class 'bs4.element.NavigableString'>
3.BeautifulSoup
BeautifulSoup对象表示的是一个文档的内容。大部分时候,可以把它当作Tag对象,是一个特殊的Tag,我们可以分别获取它的类型,名称,以及属性
#!/usr/bin/python3
# -*- coding:utf-8 -*-
__author__ = 'mayi' from bs4 import BeautifulSoup html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" # 创建 Beautiful Soup 对象,指定lxml解析器
soup = BeautifulSoup(html, "lxml") # 类型
print(type(soup.name)) # 名称
print(soup.name) # 属性
print(soup.attrs)
运行结果
<class 'str'>
[document]
{}
4.Comment
Comment对象是一个特殊类型的NavigableString对象,其输出的内容不包括注释符号。
#!/usr/bin/python3
# -*- coding:utf-8 -*-
__author__ = 'mayi' from bs4 import BeautifulSoup html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" # 创建 Beautiful Soup 对象,指定lxml解析器
soup = BeautifulSoup(html, "lxml") print(soup.a) print(soup.a.string) print(type(soup.a.string))
运行结果
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
Elsie
<class 'bs4.element.Comment'>
a标签里的内容实际上是注释,但是如果我们利用.string来输出它的内容时,注释符号已经去掉了。
遍历文档树
1.直接子节点:.contents .children属性
.content
Tag的.content属性可以将Tag的子节点以列表的方式输出
#!/usr/bin/python3
# -*- coding:utf-8 -*-
__author__ = 'mayi' from bs4 import BeautifulSoup html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" # 创建 Beautiful Soup 对象,指定lxml解析器
soup = BeautifulSoup(html, "lxml") # 输出方式为列表
print(soup.head.contents) print(soup.head.contents[0])
运行结果
[<title>The Dormouse's story</title>]
<title>The Dormouse's story</title>
.children
它返回的不是一个列表,不过我们可以通过遍历获取所有的子节点。
#!/usr/bin/python3
# -*- coding:utf-8 -*-
__author__ = 'mayi' from bs4 import BeautifulSoup html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" # 创建 Beautiful Soup 对象,指定lxml解析器
soup = BeautifulSoup(html, "lxml") # 输出方式为列表生成器对象
print(soup.head.children) # 通过遍历获取所有子节点
for child in soup.head.children:
print(child)
运行结果
<list_iterator object at 0x008FF950>
<title>The Dormouse's story</title>
2.所有子孙节点:.descendants属性
上面讲的.contents和.children属性仅包含Tag的直接子节点,.descendants属性可以对所有Tag的子孙节点进行递归循环,和children类似,我们也需要通过遍历的方式获取其中的内容。
#!/usr/bin/python3
# -*- coding:utf-8 -*-
__author__ = 'mayi' from bs4 import BeautifulSoup html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" # 创建 Beautiful Soup 对象,指定lxml解析器
soup = BeautifulSoup(html, "lxml") # 输出方式为列表生成器对象
print(soup.head.descendants) # 通过遍历获取所有子孙节点
for child in soup.head.descendants:
print(child)
运行结果
<generator object descendants at 0x00519AB0>
<title>The Dormouse's story</title>
The Dormouse's story
3.节点内容:.string属性
如果Tag只有一个NavigableString类型子节点,那么这个Tag可以使用.string得到子节点。如果一个Tag仅有一个子节点,那么这个Tab也可以使用.string方法,输出结果与当前唯一子节点的.string结果相同。
通俗点来讲就是:如果一个标签里面没有标签了,那么.string就会返回标签里面的内容。如果标签里面只有唯一的一个标签了,那么.string也会返回里面的内容。例如:
#!/usr/bin/python3
# -*- coding:utf-8 -*-
__author__ = 'mayi' from bs4 import BeautifulSoup html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" # 创建 Beautiful Soup 对象,指定lxml解析器
soup = BeautifulSoup(html, "lxml") print(soup.head.string) print(soup.head.title.string)
运行结果
The Dormouse's story
The Dormouse's story
搜索文档树
1.find_all(name, attrs, recursive, text, **kwargs)
1)name参数
name参数可以查找所有名字为name的Tag,字符串对象会被自动忽略掉
a.传字符串
最简单的过滤器就是字符串,在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配所有的内容,返回一个列表。
#!/usr/bin/python3
# -*- coding:utf-8 -*-
__author__ = 'mayi' from bs4 import BeautifulSoup html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" # 创建 Beautiful Soup 对象,指定lxml解析器
soup = BeautifulSoup(html, "lxml") print(soup.find_all("b")) print(soup.find_all("a"))
运行结果
[<b>The Dormouse's story</b>]
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
B.传正则表达式
如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式match()来匹配内容
#!/usr/bin/python3
# -*- coding:utf-8 -*-
__author__ = 'mayi' from bs4 import BeautifulSoup
import re html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" # 创建 Beautiful Soup 对象,指定lxml解析器
soup = BeautifulSoup(html, "lxml") for tag in soup.find_all(re.compile("^b")):
print(tag.name)
运行结果
body
b
C.传列表
如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容以列表方式返回
#!/usr/bin/python3
# -*- coding:utf-8 -*-
__author__ = 'mayi' from bs4 import BeautifulSoup html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" # 创建 Beautiful Soup 对象,指定lxml解析器
soup = BeautifulSoup(html, "lxml") print(soup.find_all(['a', 'b']))
2)keyword参数
#!/usr/bin/python3
# -*- coding:utf-8 -*-
__author__ = 'mayi' from bs4 import BeautifulSoup html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" # 创建 Beautiful Soup 对象,指定lxml解析器
soup = BeautifulSoup(html, "lxml") print(soup.find_all(id="link1"))
运行结果
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
3)text参数
通过text参数可以搜索文档中的字符串内容,与name参数的可选值一样,text参数接受字符串,正则表达式,列表
#!/usr/bin/python3
# -*- coding:utf-8 -*-
__author__ = 'mayi' from bs4 import BeautifulSoup
import re html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" # 创建 Beautiful Soup 对象,指定lxml解析器
soup = BeautifulSoup(html, "lxml") # 字符串
print(soup.find_all(text = " Elsie ")) # 列表
print(soup.find_all(text = ["Tillie", " Elsie ", "Lacie"])) # 正则表达式
print(soup.find_all(text = re.compile("Dormouse")))
运行结果
[' Elsie ']
[' Elsie ', 'Lacie', 'Tillie']
["The Dormouse's story", "The Dormouse's story"]
CSS选择器
这是另一种与find_all()方法有异曲同工的查找方法
- 写CSS时,标签名不加任何修饰,类名前加.,id名前加#
- 在这里我们也可以利用类似的方法来筛选元素,用到的方法是soup.select(),返回的类型是list
(1)通过标签名查找
#!/usr/bin/python3
# -*- coding:utf-8 -*-
__author__ = 'mayi' from bs4 import BeautifulSoup html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" # 创建 Beautiful Soup 对象,指定lxml解析器
soup = BeautifulSoup(html, "lxml") print(soup.select("title")) print(soup.select("b")) print(soup.select("a"))
运行结果
[<title>The Dormouse's story</title>]
[<b>The Dormouse's story</b>]
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
(2)通过类名查找
#!/usr/bin/python3
# -*- coding:utf-8 -*-
__author__ = 'mayi' from bs4 import BeautifulSoup html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" # 创建 Beautiful Soup 对象,指定lxml解析器
soup = BeautifulSoup(html, "lxml") print(soup.select(".title"))
运行结果
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>]
(3)通过id名查找
#!/usr/bin/python3
# -*- coding:utf-8 -*-
__author__ = 'mayi' from bs4 import BeautifulSoup html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" # 创建 Beautiful Soup 对象,指定lxml解析器
soup = BeautifulSoup(html, "lxml") print(soup.select("#link1"))
运行结果
[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>]
(4)组合查找
#!/usr/bin/python3
# -*- coding:utf-8 -*-
__author__ = 'mayi' from bs4 import BeautifulSoup html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" # 创建 Beautiful Soup 对象,指定lxml解析器
soup = BeautifulSoup(html, "lxml") print(soup.select("p #link1"))
运行结果
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
(5)属性查找
查找时还可以加入属性元素,属性需要用中括号括起来,注意属性和标签属于同一节点,所以中间不能加空格,否则会无法匹配到。
#!/usr/bin/python3
# -*- coding:utf-8 -*-
__author__ = 'mayi' from bs4 import BeautifulSoup html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" # 创建 Beautiful Soup 对象,指定lxml解析器
soup = BeautifulSoup(html, "lxml") print(soup.select("a[class='sister']"))
运行结果
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
同样,属性仍然可以与上述查找方式组合,不在同一节点的空格隔开,同一节点的不加空格
#!/usr/bin/python3
# -*- coding:utf-8 -*-
__author__ = 'mayi' from bs4 import BeautifulSoup html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" # 创建 Beautiful Soup 对象,指定lxml解析器
soup = BeautifulSoup(html, "lxml") print(soup.select("p a[class='sister']"))
运行结果
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
(6)获取内容
以上的select()方法返回的结果都是列表形式,可以遍历形式输出,然后用get_text()方法来获取它的内容
#!/usr/bin/python3
# -*- coding:utf-8 -*-
__author__ = 'mayi' from bs4 import BeautifulSoup html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" # 创建 Beautiful Soup 对象,指定lxml解析器
soup = BeautifulSoup(html, "lxml") print(soup.select("p a[class='sister']")) for item in soup.select("p a[class='sister']"):
print(item.get_text())
运行结果
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] Lacie
Tillie
注意:<!-- Elsie -->为注释内容,未输出
爬虫——BeautifulSoup4解析器的更多相关文章
- 爬虫笔记(四)------关于BeautifulSoup4解析器与编码
前言:本机环境配置:ubuntu 14.10,python 2.7,BeautifulSoup4 一.解析器概述 如同前几章笔记,当我们输入: soup=BeautifulSoup(response. ...
- Python爬虫开发【第1篇】【beautifulSoup4解析器】
CSS 选择器:BeautifulSoup4 Beautiful Soup 也是一个HTML/XML的解析器,主要的功能也是如何解析和提取 HTML/XML 数据. pip 安装:pip instal ...
- 爬虫中BeautifulSoup4解析器
CSS 选择器:BeautifulSoup4 和 lxml 一样,Beautiful Soup 也是一个HTML/XML的解析器,主要的功能也是如何解析和提取 HTML/XML 数据. lxml 只会 ...
- 关于BeautifulSoup4 解析器的说明
一.解析器概述 如同前几章笔记,当我们输入: soup=BeautifulSoup(response.body) 对网页进行析取时,并未规定解析器,此时使用的是python内部默认的解析器“html. ...
- Python爬虫beautifulsoup4常用的解析方法总结
摘要 如何用beautifulsoup4解析各种情况的网页 beautifulsoup4的使用 关于beautifulsoup4,官网已经讲的很详细了,我这里就把一些常用的解析方法做个总结,方便查阅. ...
- Python爬虫beautifulsoup4常用的解析方法总结(新手必看)
今天小编就为大家分享一篇关于Python爬虫beautifulsoup4常用的解析方法总结,小编觉得内容挺不错的,现在分享给大家,具有很好的参考价值,需要的朋友一起跟随小编来看看吧摘要 如何用beau ...
- python爬虫主要就是五个模块:爬虫启动入口模块,URL管理器存放已经爬虫的URL和待爬虫URL列表,html下载器,html解析器,html输出器 同时可以掌握到urllib2的使用、bs4(BeautifulSoup)页面解析器、re正则表达式、urlparse、python基础知识回顾(set集合操作)等相关内容。
本次python爬虫百步百科,里面详细分析了爬虫的步骤,对每一步代码都有详细的注释说明,可通过本案例掌握python爬虫的特点: 1.爬虫调度入口(crawler_main.py) # coding: ...
- Python爬虫(十四)_BeautifulSoup4 解析器
CSS选择器:BeautifulSoup4 和lxml一样,Beautiful Soup也是一个HTML/XML的解析器,主要的功能也是如何解析和提取HTML/XML数据. lxml只会局部遍历,而B ...
- Python HTML解析器BeautifulSoup(爬虫解析器)
BeautifulSoup简介 我们知道,Python拥有出色的内置HTML解析器模块——HTMLParser,然而还有一个功能更为强大的HTML或XML解析工具——BeautifulSoup(美味的 ...
随机推荐
- 元类(metaclass)
一.储备知识exec 储备知识exec:有下面三个参数 参数一:字符串形式的命令 参数二:全局作用域(字典形式),如果不指定默认使用globals() 参数三:局部作用域(字典形式),如果不指定默认就 ...
- BZOJ2438: [中山市选2011]杀人游戏(tarjan)
题意 题目链接 Sol 这题挺考验阅读理解能力的.. 如果能读懂的话,不难发现这就是在统计有多少入度为\(0\)的点 缩点后判断一下即可 当然有一种例外情况是\(1 -> 3, 2 -> ...
- es6新增的数组方法和对象
es6新增的遍历数组的方法,后面都会用这个方法来遍历数组,或者对象,还有set,map let arr=[1,2,3,4,3,2,1,2]; 遍历数组最简洁直接的方法法 for (let value ...
- CSS如何设置换行文字自动对齐
CSS如何设置换行文字自动对齐 如图所示: 代码实现如下: <ul class='warn-page-content'> <li> ...
- 面向切面编程-AOP的介绍
AOP简介 AOP(Aspect-Oriented Programming, 面向切面编程): 是一种新的方法论, 是对传统 OOP(Object-Oriented Programming, 面向对象 ...
- Selenium2学习(五)-- SeleniumBuilder辅助定位元素
前言 福利来了,对于用火狐浏览器的小伙伴们,你还在为定位元素而烦恼嘛? 上古神器Selenium Builder来啦,哪里不会点哪里,妈妈再也不用担心我的定位元素问题啦!(但是也不是万能,基本上都能覆 ...
- JSP实现用户登录样例
业务描述 用户在login.jsp页面输入用户名密码登录: 如果用户名为xingoo,密码为123,则跳转到成功界面login_success.jsp,并显示用户登录的名字: 如果用户名密码错误,则跳 ...
- oozie 完整流程实例
Oozie概述: Oozie是一个基于Hadoop工作流引擎,也可以称为调度器,它以xml的形式写调度流程,可以调度mr,pig,hive,shell,jar,spark等等.在实际工作中,遇到对数据 ...
- CSU 1974
Description 对于csuxushu来说,能够在CSU(California State University)组织2017年的ACM暑期集训让他感到十分荣幸. csuxushu是一名充满梦想 ...
- [18/11/26] this关键字、static关键字和静态块(及语句块)
1.this关键字 this的本质就是“创建好的对象的地址”! 由于在构造方法调用前,对象已经创建.因此,在构造方法中也可以使用this代表“当前对象” [用法] 1. 在程序中产生二义性之处 ...