使用Beautiful Soup
Beautiful Soup初了解
# 解析工具Beautiful Soup,借助网页的结构和属性等特性来解析网页(简单的说就是python的一个HTML或XML的解析库)
# Beautiful Soup支持的解析器解析器 使用方法 优势 劣势
Python标准库
BeautifulSoup(markup, " html. parser ")
Python 的内 宜标准库、执行速度适中、文档容错能力强
Python 2.7.3及 Python3.2.2 之前的版本文档容错能力差
lxml HTML解析器
BeautifulSoup(markup,"lxml")
速度快、文档容错能力强
需要安装c语言库
lxmlXML解析器
BeautifulSoup(markup,"xml")
速度快、唯一支持 XML 的解析器
需要安装c语言库
html5lib
BeautifulSoup(markup,"htmlSlib")
最好的容错性、以浏览器的
方式解析文梢、生成 HTML5
格式的文档
速度慢、不依赖外部扩展
实例引入:
from bs4 import BeautifulSoup soup = BeautifulSoup('<p>Hello</p>', 'lxml')
print(soup.p.string) # 输出:
HelloBeautiful Soup基本用法
from bs4 import BeautifulSoup html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" soup = BeautifulSoup(html, 'lxml')
print(soup.prettify(), soup.title.string, sep='\n\n')
# 初始化BeautifulSoup时,自动更正了不标准的HTML
# prettify()方法可以把要解析的字符串以标准的缩进格式输出
# soup.title 可以选出HTML中的title节点,再调用string属性就可以得到里面的文本了 # 输出:
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title" name="dromouse">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<!-- Elsie -->
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html> The Dormouse's story节点选择器
# 选择元素
from bs4 import BeautifulSoup html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" soup = BeautifulSoup(html, 'lxml') print(soup.title) # 打印输出title节点的选择结果
print(type(soup.title)) # 输出soup.title类型
print(soup.title.string) # 输出title节点的内容
print(soup.head) # 打印输出head节点的选择结果
print(soup.p) # 打印输出p节点的选择结果 # 输出:
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p># 提取信息
# 调用string属性获取文本的值
# 利用那么属性获取节点的名称
# 调用attrs获取所有HTML节点属性from bs4 import BeautifulSoup html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" soup = BeautifulSoup(html, 'lxml') print(soup.title.name) # 选取title节点,然后调用name属性获得节点名称
# 输出:title
print(soup.title.string) # 调用string属性,获取title节点的文本值
# 输出:The Dormouse's story print(soup.p.attrs) # 调用attrs,获取p节点的所有属性
# 输出:{'class': ['title'], 'name': 'dromouse'} print(soup.p.attrs['name']) # 获取name属性
# 输出:dromouse
print(soup.p['name']) # 获取name属性
# 输出:dromouse# 嵌套选择
from bs4 import BeautifulSoup html = """
<html><head><title>The Dormouse's story</title></head>
<body>
""" soup = BeautifulSoup(html, 'lxml')
print(soup.head.title)
print(type(soup.head.title))
print(soup.head.title.string) # 输出:
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story# 关联选择
# 1、子节点和子孙节点
# contents属性得到的结果是直接子节点的列表。from bs4 import BeautifulSoup html = """
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<!-- Elsie -->
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
""" soup = BeautifulSoup(html, 'lxml')
# 选取节点元素之后,可以调用contents属性获取它的直接子节点
print(soup.p.contents) # 输出:
['\n Once upon a time there were three little sisters; and their names were\n ', <a class="sister" href="http://example.com/elsie" id="link1">
<!-- Elsie -->
</a>, '\n ,\n ', <a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>, '\n and\n ', <a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>, '\n ;\nand they lived at the bottom of a well.\n ']
# 返回结果是一个列表,列表中的元素是所选节点的直接子节点(不包括孙节点)直接子节点
# children属性,返回结果是生成器类型。与contents属性一样,只是返回结果类型不同。
from bs4 import BeautifulSoup html = """
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
""" soup = BeautifulSoup(html, 'lxml')
print(soup.p.children) # 输出:<list_iterator object at 0x1159b7668>
for i, child in enumerate(soup.p.children):
print(i, child) # for 循环的输出结果:
0
Once upon a time there were three little sisters; and their names were 1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2
, 3 <a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
4
and 5 <a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
6
;
and they lived at the bottom of a well.直接子节点
# descendants属性会递归查询所有子节点,得到所有子孙节点。
from bs4 import BeautifulSoup html = """
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
""" soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants) # 输出:<generator object Tag.descendants at 0x1131d0048>
for i, child in enumerate(soup.p.descendants):
print(i, child) # for 循环输出结果:
0
Once upon a time there were three little sisters; and their names were 1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 3 <span>Elsie</span>
4 Elsie
5 6
, 7 <a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
8
Lacie 9
and 10 <a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
11
Tillie 12
;
and they lived at the bottom of a well.获取子孙节点
# 2、父节点和祖先节点
from bs4 import BeautifulSoup html = """
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">
...
</p>
</body>
</html>
""" soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent) # 输出:
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>parent获取某个节点的一个父节点
from bs4 import BeautifulSoup html = """
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">
...
</p>
</body>
</html>
""" soup = BeautifulSoup(html, 'lxml')
print(soup.a.parents, type(soup.a.parents), list(enumerate(soup.a.parents)), sep='\n\n') # 输出:
<generator object PageElement.parents at 0x11c76e048> <class 'generator'> [(0, <p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>), (1, <body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">
...
</p>
</body>), (2, <html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">
...
</p>
</body>
</html>), (3, <html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">
...
</p>
</body>
</html>
)]parent获取所有祖先节点
# 涉及内置函数enumerate()
# enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列,同时列出数据和数据下标,一般用在 for 循环当中。# enumerate() 函数用于将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列,同时列出数据和数据下标,一般用在 for 循环当中。 a = ["恕", "我", "直", "言", "在", "坐", "的", "各", "位", "都", "是", "爱", "学", "习", "的"]
print(a) # 输出:['恕', '我', '直', '言', '在', '坐', '的', '各', '位', '都', '是', '爱', '学', '习', '的']
b = enumerate(a)
print(enumerate(a)) # 输出:<enumerate object at 0x11a1f8b40>
print(list(b))
# [(0, '恕'), (1, '我'), (2, '直'), (3, '言'), (4, '在'), (5, '坐'), (6, '的'), (7, '各'), (8, '位'), (9, '都'),
# (10, '是'), (11, '爱'), (12, '学'), (13, '习'), (14, '的')] for m, n in enumerate(a):
print(m, n)
# for 循环 输出:
0 恕
1 我
2 直
3 言
4 在
5 坐
6 的
7 各
8 位
9 都
10 是
11 爱
12 学
13 习
14 的enumerate()内置函数
# 3、兄弟节点
from bs4 import BeautifulSoup html = """
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
""" soup = BeautifulSoup(html, 'lxml')
print(
# 获取下一个兄弟元素
{'Next Sibling': soup.a.next_sibling},
# 获取上一个兄弟元素
{'Previous Sibling': soup.a.previous_sibling},
# 返回后面的兄弟元素
{'Next Siblings': list(enumerate(soup.a.next_siblings))},
# 返回前面的兄弟元素
{'Previous Siblings': list(enumerate(soup.a.previous_siblings))}, sep='\n\n'
) # 输出:
{'Next Sibling': '\n ,\n '} {'Previous Sibling': '\n Once upon a time there were three little sisters; and their names were\n '} {'Next Siblings': [(0, '\n ,\n '), (1, <a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>), (2, '\n and\n '), (3, <a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>), (4, '\n ;\nand they lived at the bottom of a well.\n ')]} {'Previous Siblings': [(0, '\n Once upon a time there were three little sisters; and their names were\n ')]}获取同级节点
# 4、提取信息
from bs4 import BeautifulSoup html = """
<html>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Bob</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
</p>
</body>
</html>
""" soup = BeautifulSoup(html, 'lxml')
print(
'Next Sibling:', [soup.a.next_sibling], # 获取上一个兄弟节点
# \n
type(soup.a.next_sibling), # 上一个兄弟节点的类型
# <class 'bs4.element.NavigableString'>
[soup.a.next_sibling.string], # 获取上一个兄弟节点的内容
# \n
sep='\n'
) print(
'Parent:', [type(soup.a.parents)], # 获取所有的祖先节点
# <class 'generator'>
[list(soup.a.parents)[0]], # 获取第一个祖先节点
# <p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Bob</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
</p>
[list(soup.a.parents)[0].attrs['class']], # 获取第一个祖先节点的"class属性"的值
# ['story']
sep='\n'
) # 为了输出返回的结果,均以列表形式 # 输出:
Next Sibling:
['\n']
<class 'bs4.element.NavigableString'>
['\n']
Parent:
[<class 'generator'>]
[<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Bob</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
</p>]
[['story']]方法选择器
find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)
# 查询所有符合条件的元素from bs4 import BeautifulSoup html = """
<div>
<ul>
<li class="item-O"><a href="linkl.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
""" soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(name='li'),
type(soup.find_all(name='li')[0]),
sep='\n\n') # 输出:
[<li class="item-O"><a href="linkl.html">first item</a></li>, <li class="item-1"><a href="link2.html">second item</a></li>, <li class="item-inactive"><a href="link3.html">third item</a></li>, <li class="item-1"><a href="link4.html">fourth item</a></li>, <li class="item-0"><a href="link5.html">fifth item</a>
</li>] <class 'bs4.element.Tag'> # 返回值是一个列表,列表的元素是名为"li"的节点,每个元素都是bs4.element.Tag类型 # 遍历每个a节点
from bs4 import BeautifulSoup html = """
<div>
<ul>
<li class="item-O"><a href="linkl.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
""" soup = BeautifulSoup(html, 'lxml')
li = soup.find_all(name='li') for a in li:
print(a.find_all(name='a')) # 输出:
[<a href="linkl.html">first item</a>]
[<a href="link2.html">second item</a>]
[<a href="link3.html">third item</a>]
[<a href="link4.html">fourth item</a>]
[<a href="link5.html">fifth item</a>]name参数
from bs4 import BeautifulSoup html = """
<div>
<ul>
<li class="item-O"><a href="linkl.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
""" soup = BeautifulSoup(html, 'lxml') print(soup.find_all(attrs={'class': 'item-0'}))
print(soup.find_all(attrs={'href': 'link5.html'})) # 输出:
[<li class="item-0"><a href="link5.html">fifth item</a>
</li>]
[<a href="link5.html">fifth item</a>] # 可以通过attrs参数传入一些属性来进行查询,即通过特定的属性来查询
# find_all(attrs={'属性名': '属性值', ......})attrs参数
from bs4 import BeautifulSoup
import re html = """
<div class="panel">
<div class="panel-body">
<a>Hello, this is a link</a>
<a>Hello, this is a link, too</a>
<div/>
<div/>
""" soup = BeautifulSoup(html, 'lxml') # 正则表达式规则对象
regular = re.compile('link') # text参数课用来匹配节点的文本,传入的形式可以是字符串,也可以是正则表达式对象
print(soup.find_all(text=regular)) # 正则匹配输出
print(re.findall(regular, html)) # 输出:
['Hello, this is a link', 'Hello, this is a link, too']
['link', 'link']text参数
find(name=None, attrs={}, recursive=True, text=None, **kwargs) 仅返回与给定条件匹配标记的第一个元素
CSS选择器
- Beautiful Soup 提供了CSS选择器,调用select()方法即可
- css选择器用法:http://www.w3school.com.cn/cssref/css_selectors.asp
select(selector, namespaces=None, limit=None, **kwargs)
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
''' from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml') print(
soup.select('.panel .panel-heading'), soup.select('ul li'), soup.select('#list-2 .element'), type(soup.select('ul')[0]), sep='\n\n'
) # 输出:
[<div class="panel-heading">
<h4>Hello</h4>
</div>] [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>] [<li class="element">Foo</li>, <li class="element">Bar</li>] <class 'bs4.element.Tag'>简单示例
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
''' from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml') ul_all = soup.select('ul')
print(ul_all) for ul in ul_all:
print()
print(
ul['id'], ul.select('li'), sep='\n'
) # 输出:
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>] list-1
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] list-2
[<li class="element">Foo</li>, <li class="element">Bar</li>]嵌套选择
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
''' from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml') ul_all = soup.select('ul')
print(ul_all) for ul in ul_all:
print()
print(
ul['id'], ul.attrs['id'], sep='\n'
) # 直接传入中括号和属性名 或者 通过attrs属性获取属性值 都可以成功获得属性值 # 输出:
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>] list-1
list-1 list-2
list-2获取属性
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
''' from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml') ul_all = soup.select('li')
print(ul_all) for li in ul_all:
print()
print(
'get_text()方法获取文本:'+li.get_text(), 'string属性获取文本:'+li.string, sep='\n'
) # 输出:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>] get_text()方法获取文本:Foo
string属性获取文本:Foo get_text()方法获取文本:Bar
string属性获取文本:Bar get_text()方法获取文本:Jay
string属性获取文本:Jay get_text()方法获取文本:Foo
string属性获取文本:Foo get_text()方法获取文本:Bar
string属性获取文本:Bar获取文本
使用Beautiful Soup的更多相关文章
- 使用Beautiful Soup编写一个爬虫 系列随笔汇总
这几篇博文只是为了记录学习Beautiful Soup的过程,不仅方便自己以后查看,也许能帮到同样在学习这个技术的朋友.通过学习Beautiful Soup基础知识 完成了一个简单的爬虫服务:从all ...
- 网络爬虫: 从allitebooks.com抓取书籍信息并从amazon.com抓取价格(1): 基础知识Beautiful Soup
开始学习网络数据挖掘方面的知识,首先从Beautiful Soup入手(Beautiful Soup是一个Python库,功能是从HTML和XML中解析数据),打算以三篇博文纪录学习Beautiful ...
- Python爬虫学习(11):Beautiful Soup的使用
之前我们从网页中提取重要信息主要是通过自己编写正则表达式完成的,但是如果你觉得正则表达式很好写的话,那你估计不是地球人了,而且很容易出问题.下边要介绍的Beautiful Soup就可以帮你简化这些操 ...
- 推荐一些python Beautiful Soup学习网址
前言:这几天忙着写分析报告,实在没精力去研究django,虽然抽时间去看了几遍中文文档,还是等实际实践后写几篇操作文章吧! 正文:以下是本人前段时间学习bs4库找的一些网址,在学习的可以参考下,有点多 ...
- 错误 You are trying to run the Python 2 version of Beautiful Soup under Python 3. This will not work
Win 10 下python3.6 使用Beautiful Soup 4错误 You are trying to run the Python 2 version of Beautiful ...
- Python学习笔记之Beautiful Soup
如何在Python3.x中使用Beautiful Soup 1.BeautifulSoup中文文档:http://www.crummy.com/software/BeautifulSoup/bs3/d ...
- Python Beautiful Soup学习之HTML标签补全功能
Beautiful Soup是一个非常流行的Python模块.该模块可以解析网页,并提供定位内容的便捷接口. 使用下面两个命令安装: pip install beautifulsoup4 或者 sud ...
- 转:Beautiful Soup
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时 ...
- Beautiful Soup教程 转
Python中使用Beautiful Soup库的超详细教程 转 http://www.jb51.net/article/65287.htm 作者:崔庆才 字体:[增加 减小] 类型:转载 时间:20 ...
- Beautiful Soup第三方爬虫插件
什么是BeautifulSoup? Beautiful Soup 是用Python写的一个HTML/XML的解析器,它可以很好的处理不规范标记并生成剖析树(parse tree). 它提供简单又常用的 ...
随机推荐
- Mybatis-Plus学习.,简化你的开发,提升开发效率.
Mybatis-Plus学习以及实践 简介 参考网址 开发环境 开始开发 - 配置pom.xml - Dao/Service/Entity - 基本方法 - 查询类 - 更新类 分页插件 - pom. ...
- ELK 学习笔记之 Logstash之codec配置
Logstash之codec: Logstash处理流程: input=>decode=>filter=>encode=>output 分类: Plain编码: input{ ...
- win10 php安装redis 扩展
redis下载:https://github.com/MicrosoftArchive/redis/releases 我下载的是zip包,下载后安装redis. 开始安装php的reids扩展 查看p ...
- spring boot通过Jedis来操作redis
idea中新建spring boot项目,引入jedis依赖 <!-- https://mvnrepository.com/artifact/redis.clients/jedis --> ...
- redis等缓存
文章出处 https://www.cnblogs.com/wupeiqi/articles/5246483.html Model 到目前为止,当我们的程序涉及到数据库相关操作时,我们一般都会这么搞: ...
- Java自动化测试框架-01 - TestNG之入门篇 - 大佬的鸡肋,菜鸟的盛宴(详细教程)
TestNG是什么? TestNG按照官方的定义: TestNG是一个测试框架,其灵感来自JUnit和NUnit,但引入了一些新的功能,使其功能更强大,使用更方便. TestNG是一个开源自动化测试框 ...
- JDK1.7中HashMap死环问题及JDK1.8中对HashMap的优化源码详解
一.JDK1.7中HashMap扩容死锁问题 我们首先来看一下JDK1.7中put方法的源码 我们打开addEntry方法如下,它会判断数组当前容量是否已经超过的阈值,例如假设当前的数组容量是16,加 ...
- 想转行做3D游戏模型,如何快速入行
随着技术和硬件迭代,3D建模,广泛运用在游戏,影视,动画,VR等领域,而且就业面非常广. 由于3D美术设计师薪资和前景确实都不错,很多同学想进入这个行业,从事相关工作,但是没有一个整体的学习思路和规划 ...
- 使用CDPATH快速cd到指定路径
CDPATH是shell的一个环境变量, 默认值为空: 将你常用的目录添加到CDPATH的目录列表中, 用':'冒号分隔, 比如, 当前目录 ., home目录 ~, 根目录 /, 等等: # 注意等 ...
- Cocos2d-x 学习笔记(7) 内存管理 Sprite SpriteFrame Texture2D
1. 总结 Sprite和SpriteFrame和Texture2D关系紧密,三个类都继承了Ref类.精灵有成员精灵帧和纹理,精灵帧有成员纹理.精灵帧和纹理被精灵帧引用时,引用计数增加,不再被引用时, ...