PYTHON 爬虫笔记五:BeautifulSoup库基础用法
知识点一:BeautifulSoup库详解及其基本使用方法
什么是BeautifulSoup
灵活又方便的网页解析库,处理高效,支持多种解析器。利用它不用编写正则表达式即可方便实现网页信息的提取库。
BeautifulSoup中常见的解析库
基本用法:
- html = '''
- <html><head><title>The Domouse's story</title></head>
- <body>
- <p class="title"name="dromouse"><b>The Dormouse's story</b></p>
- <p class="story">Once upon a time there were little sisters;and their names were
- <a href="http://example.com/elsie"class="sister"id="link1"><!--Elsie--></a>
- <a hred="http://example.com/lacle"class="sister"id="link2">Lacle</a>and
- <a hred="http://example.com/tilie"class="sister"id="link3">Tillie</a>
- and they lived at bottom of a well.</p>
- <p class="story">...</p>
- '''
- from bs4 import BeautifulSoup
- soup= BeautifulSoup(html,'lxml')
- print(soup.prettify())#格式化代码,打印结果自动补全缺失的代码
- print(soup.title.string)#文章标题
- <html>
- <head>
- <title>
- The Domouse's story
- </title>
- </head>
- <body>
- <p class="title" name="dromouse">
- <b>
- The Dormouse's story
- </b>
- </p>
- <p class="story">
- Once upon a time there were little sisters;and their names were
- <a class="sister" href="http://example.com/elsie" id="link1">
- <!--Elsie-->
- </a>
- <a class="sister" hred="http://example.com/lacle" id="link2">
- Lacle
- </a>
- and
- <a class="sister" hred="http://example.com/tilie" id="link3">
- Tillie
- </a>
- and they lived at bottom of a well.
- </p>
- <p class="story">
- ...
- </p>
- </body>
- </html>
- The Domouse's story
获得的结果
- html = '''
标签选择器
选择元素
- html = '''
- <html><head><title>The Domouse's story</title></head>
- <body>
- <p class="title"name="dromouse"><b>The Dormouse's story</b></p>
- <p class="story">Once upon a time there were little sisters;and their names were
- <a href="http://example.com/elsie"class="sister"id="link1"><!--Elsie--></a>
- <a hred="http://example.com/lacle"class="sister"id="link2">Lacle</a>and
- <a hred="http://example.com/tilie"class="sister"id="link3">Tillie</a>
- and they lived at bottom of a well.</p>
- <p class="story">...</p>
- '''
- from bs4 import BeautifulSoup
- soup = BeautifulSoup(html,'lxml')
- print(soup.title)
- #<title>The Domouse's story</title>
- print(type(soup.title))
- #<class 'bs4.element.Tag'>
- print(soup.head)
- #<head><title>The Domouse's story</title></head>
- print(soup.p)#当出现多个时,只返回第一个
- #<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
- html = '''
获取标签名称
- html = '''
- <html><head><title>The Domouse's story</title></head>
- <body>
- <p class="title"name="dromouse"><b>The Dormouse's story</b></p>
- <p class="story">Once upon a time there were little sisters;and their names were
- <a href="http://example.com/elsie"class="sister"id="link1"><!--Elsie--></a>
- <a hred="http://example.com/lacle"class="sister"id="link2">Lacle</a>and
- <a hred="http://example.com/tilie"class="sister"id="link3">Tillie</a>
- and they lived at bottom of a well.</p>
- <p class="story">...</p>
- '''
- from bs4 import BeautifulSoup
- soup = BeautifulSoup(html,'lxml')
- print(soup.title.name)
- #title
- html = '''
获取属性
- html = '''
- <html><head><title>The Domouse's story</title></head>
- <body>
- <p class="title"name="dromouse"><b>The Dormouse's story</b></p>
- <p class="story">Once upon a time there were little sisters;and their names were
- <a href="http://example.com/elsie"class="sister"id="link1"><!--Elsie--></a>
- <a hred="http://example.com/lacle"class="sister"id="link2">Lacle</a>and
- <a hred="http://example.com/tilie"class="sister"id="link3">Tillie</a>
- and they lived at bottom of a well.</p>
- <p class="story">...</p>
- '''
- from bs4 import BeautifulSoup
- soup = BeautifulSoup(html,'lxml')
- print(soup.p.attrs['name'])
- #dromouse
- print(soup.p['name'])
- #dromouse
- html = '''
获取标签内容
- html = '''
- <html><head><title>The Domouse's story</title></head>
- <body>
- <p class="title"name="dromouse"><b>The Dormouse's story</b></p>
- <p class="story">Once upon a time there were little sisters;and their names were
- <a href="http://example.com/elsie"class="sister"id="link1"><!--Elsie--></a>
- <a hred="http://example.com/lacle"class="sister"id="link2">Lacle</a>and
- <a hred="http://example.com/tilie"class="sister"id="link3">Tillie</a>
- and they lived at bottom of a well.</p>
- <p class="story">...</p>
- '''
- from bs4 import BeautifulSoup
- soup = BeautifulSoup(html,'lxml')
- print(soup.p.string)
- #The Dormouse's story
- html = '''
嵌套选择
- html = '''
- <html><head><title>The Domouse's story</title></head>
- <body>
- <p class="title"name="dromouse"><b>The Dormouse's story</b></p>
- <p class="story">Once upon a time there were little sisters;and their names were
- <a href="http://example.com/elsie"class="sister"id="link1"><!--Elsie--></a>
- <a hred="http://example.com/lacle"class="sister"id="link2">Lacle</a>and
- <a hred="http://example.com/tilie"class="sister"id="link3">Tillie</a>
- and they lived at bottom of a well.</p>
- <p class="story">...</p>
- '''
- from bs4 import BeautifulSoup
- soup = BeautifulSoup(html,'lxml')
- print(type(soup.title))
- #<class 'bs4.element.Tag'>
- print(soup.head.title.string)#观察html的代码,其中有一层包含的关系:head(title),那我们就可以用嵌套的形式将其内容打印出来;body(p或是a)
- #The Domouse's story
- html = '''
子节点和子孙节点
- #获取标签的子节点
- html2 = '''
- <html>
- <head>
- <title>The Domouse's story</title>
- </head>
- <body>
- <p class="story">
- Once upon a time there were little sisters;and their names were
- <a href="http://example.com/elsie" class="sister"id="link1">
- <span>Elsle</span>
- </a>
- <a hred="http://example.com/lacle"class="sister" id="link2">Lacle</a>
- and
- <a hred="http://example.com/tilie"class="sister" id="link3">Tillie</a>
- and they lived at bottom of a well.
- </p>
- <p class="story">...</p>
- '''
- from bs4 import BeautifulSoup
- soup2 = BeautifulSoup(html2,'lxml')
- print(soup2.p.contents)
- ['\n Once upon a time there were little sisters;and their names were\n ', <a class="sister" href="http://example.com/elsie" id="link1">
- <span>Elsle</span>
- </a>, '\n', <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>, '\n and\n ', <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>, '\n and they lived at bottom of a well.\n ']
获得的内容
另一中方法:
- #获取标签的子节点
- html2 = '''
- <html>
- <head>
- <title>The Domouse's story</title>
- </head>
- <body>
- <p class="story">
- Once upon a time there were little sisters;and their names were
- <a href="http://example.com/elsie" class="sister"id="link1">
- <span>Elsle</span>
- </a>
- <a hred="http://example.com/lacle"class="sister" id="link2">Lacle</a>
- and
- <a hred="http://example.com/tilie"class="sister" id="link3">Tillie</a>
- and they lived at bottom of a well.
- </p>
- <p class="story">...</p>
- '''
- from bs4 import BeautifulSoup
- soup = BeautifulSoup(html2,'lxml')
- print(soup.children)#不同之处:children实际上是一个迭代器,需要用循环的方式才能将内容取出
- for i,child in enumerate(soup.p.children):
- print(i,child)
- <list_iterator object at 0x00000208F026B400>
- 0
- Once upon a time there were little sisters;and their names were
- 1 <a class="sister" href="http://example.com/elsie" id="link1">
- <span>Elsle</span>
- </a>
- 2
- 3 <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>
- 4
- and
- 5 <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>
- 6
- and they lived at bottom of a well.
获得的结果
不同之处:children实际上是一个迭代器,需要用循环的方式才能将内容取出,而子节点只是一个列表
- #获取标签的子孙节点
- html2 = '''
- <html>
- <head>
- <title>The Domouse's story</title>
- </head>
- <body>
- <p class="story">
- Once upon a time there were little sisters;and their names were
- <a href="http://example.com/elsie" class="sister"id="link1">
- <span>Elsle</span>
- </a>
- <a hred="http://example.com/lacle"class="sister" id="link2">Lacle</a>
- and
- <a hred="http://example.com/tilie"class="sister" id="link3">Tillie</a>
- and they lived at bottom of a well.
- </p>
- <p class="story">...</p>
- '''
- from bs4 import BeautifulSoup
- soup = BeautifulSoup(html2,'lxml')
- print(soup2.p.descendants)#获取所有的子孙节点,也是一个迭代器
- for i,child in enumerate(soup2.p.descendants):
- print(i,child)
子孙节点
- <generator object descendants at 0x00000208F0240AF0>
- 0
- Once upon a time there were little sisters;and their names were
- 1 <a class="sister" href="http://example.com/elsie" id="link1">
- <span>Elsle</span>
- </a>
- 2
- 3 <span>Elsle</span>
- 4 Elsle
- 5
- 6
- 7 <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>
- 8 Lacle
- 9
- and
- 10 <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>
- 11 Tillie
- 12
- and they lived at bottom of a well.
--->获得的结果
- #获取标签的子节点
父节点和祖先节点
- #父节点
- html = '''
- <html>
- <head>
- <title>The Domouse's story</title>
- </head>
- <body>
- <p class="story">
- Once upon a time there were little sisters;and their names were
- <a href="http://example.com/elsie" class="sister"id="link1">
- <span>Elsle</span>
- </a>
- <a hred="http://example.com/lacle"class="sister" id="link2">Lacle</a>
- and
- <a hred="http://example.com/tilie"class="sister" id="link3">Tillie</a>
- and they lived at bottom of a well.
- </p>
- <p class="story">...</p>
- '''
- from bs4 import BeautifulSoup
- soup = BeautifulSoup(html,'lxml')
- print(soup.a.parent)
父节点
- <p class="story">
- Once upon a time there were little sisters;and their names were
- <a class="sister" href="http://example.com/elsie" id="link1">
- <span>Elsle</span>
- </a>
- <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>
- and
- <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>
- and they lived at bottom of a well.
- </p>
--->获得的结果
- #获取祖先节点
- html = '''
- <html>
- <head>
- <title>The Domouse's story</title>
- </head>
- <body>
- <p class="story">
- Once upon a time there were little sisters;and their names were
- <a href="http://example.com/elsie" class="sister"id="link1">
- <span>Elsle</span>
- </a>
- <a hred="http://example.com/lacle"class="sister" id="link2">Lacle</a>
- and
- <a hred="http://example.com/tilie"class="sister" id="link3">Tillie</a>
- and they lived at bottom of a well.
- </p>
- <p class="story">...</p>
- '''
- from bs4 import BeautifulSoup
- soup = BeautifulSoup(html,'lxml')
- print(list(enumerate(soup.a.parents)))#所有祖先节点(爸爸也算)
祖先节点
- [(0, <p class="story">
- Once upon a time there were little sisters;and their names were
- <a class="sister" href="http://example.com/elsie" id="link1">
- <span>Elsle</span>
- </a>
- <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>
- and
- <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>
- and they lived at bottom of a well.
- </p>), (1, <body>
- <p class="story">
- Once upon a time there were little sisters;and their names were
- <a class="sister" href="http://example.com/elsie" id="link1">
- <span>Elsle</span>
- </a>
- <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>
- and
- <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>
- and they lived at bottom of a well.
- </p>
- <p class="story">...</p>
- </body>), (2, <html>
- <head>
- <title>The Domouse's story</title>
- </head>
- <body>
- <p class="story">
- Once upon a time there were little sisters;and their names were
- <a class="sister" href="http://example.com/elsie" id="link1">
- <span>Elsle</span>
- </a>
- <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>
- and
- <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>
- and they lived at bottom of a well.
- </p>
- <p class="story">...</p>
- </body></html>), (3, <html>
- <head>
- <title>The Domouse's story</title>
- </head>
- <body>
- <p class="story">
- Once upon a time there were little sisters;and their names were
- <a class="sister" href="http://example.com/elsie" id="link1">
- <span>Elsle</span>
- </a>
- <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>
- and
- <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>
- and they lived at bottom of a well.
- </p>
- <p class="story">...</p>
- </body></html>)]
--->获得的内容
- #父节点
兄弟节点
- #获取前兄弟节点
- html = '''
- <html>
- <head>
- <title>The Domouse's story</title>
- </head>
- <body>
- <p class="story">
- Once upon a time there were little sisters;and their names were
- <a href="http://example.com/elsie" class="sister"id="link1">
- <span>Elsle</span>
- </a>
- <a hred="http://example.com/lacle"class="sister" id="link2">Lacle</a>
- and
- <a hred="http://example.com/tilie"class="sister" id="link3">Tillie</a>
- and they lived at bottom of a well.
- </p>
- <p class="story">...</p>
- '''
- from bs4 import BeautifulSoup
- soup = BeautifulSoup(html,'lxml')
- #兄弟节点(与之并列的节点)
- print(list(enumerate(soup.a.previous_siblings)))#前面的兄弟节点
前兄弟节点
- [(0, '\n Once upon a time there were little sisters;and their names were\n ')]
--->获得的内容
- html = '''
- <html>
- <head>
- <title>The Domouse's story</title>
- </head>
- <body>
- <p class="story">
- Once upon a time there were little sisters;and their names were
- <a href="http://example.com/elsie" class="sister"id="link1">
- <span>Elsle</span>
- </a>
- <a hred="http://example.com/lacle"class="sister" id="link2">Lacle</a>
- and
- <a hred="http://example.com/tilie"class="sister" id="link3">Tillie</a>
- and they lived at bottom of a well.
- </p>
- <p class="story">...</p>
- '''
- from bs4 import BeautifulSoup
- soup = BeautifulSoup(html,'lxml')
- #兄弟节点(与之并列的节点)
- print(list(enumerate(soup.a.next_siblings)))#后面的兄弟节点
后面兄弟节点
- [(0, '\n'), (1, <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>), (2, '\n and\n '), (3, <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>), (4, '\n and they lived at bottom of a well.\n ')]
--->获得的结果
- #获取前兄弟节点
标准选择器
find_all(name,attrs,recursive,text,**kwargs)
可以根据标签名,属性,内容查找文档
根据name查找
- html = '''
- <div class="panel">
- <div class="panel-heading"name="elements">
- <h4>Hello</h4>
- </div>
- <div class="panel-body">
- <ul class="list"Id="list-1">
- <li class="element">Foo</li>
- <li class="element">Bar</li>
- <li class="element">Jay</li>
- </ul>
- <ul class="list list-small"Id="list-2">
- <li class="element">Foo</li>
- <li class="element">Bar</li>
- </ul>
- </div>
- <div>
- '''
- from bs4 import BeautifulSoup
- soup = BeautifulSoup(html,'lxml')
- print(soup.find_all('ul'))#列表类型
- print(type(soup.find_all('ul')[0]))
- [<ul class="list" id="list-1">
- <li class="element">Foo</li>
- <li class="element">Bar</li>
- <li class="element">Jay</li>
- </ul>, <ul class="list list-small" id="list-2">
- <li class="element">Foo</li>
- <li class="element">Bar</li>
- </ul>]
- <class 'bs4.element.Tag'>
获得的结果
- html = '''
- <div class="panel">
- <div class="panel-heading"name="elements">
- <h4>Hello</h4>
- </div>
- <div class="panel-body">
- <ul class="list"Id="list-1">
- <li class="element">Foo</li>
- <li class="element">Bar</li>
- <li class="element">Jay</li>
- </ul>
- <ul class="list list-small"Id="list-2">
- <li class="element">Foo</li>
- <li class="element">Bar</li>
- </ul>
- </div>
- <div>
- '''
- from bs4 import BeautifulSoup
- soup = BeautifulSoup(html,'lxml')
- for ul in soup.find_all('ul'):
- print(ul.find_all('li'))#层层嵌套的查找
- [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
- [<li class="element">Foo</li>, <li class="element">Bar</li>]
获得的结果
- html = '''
根据attrs查找
- html = '''
- <div class="panel">
- <div class="panel-heading">
- <h4>Hello</h4>
- </div>
- <div class="panel-body">
- <ul class="list"id="list-1" name="elements">
- <li class="element">Foo</li>
- <li class="element">Bar</li>
- <li class="element">Jay</li>
- </ul>
- <ul class="list list-small"id="list-2">
- <li class="element">Foo</li>
- <li class="element">Bar</li>
- </ul>
- </div>
- <div>
- '''
- from bs4 import BeautifulSoup
- soup = BeautifulSoup(html,'lxml')
- print(soup.find_all(attrs={'id':'list-1'}))
- print(soup.find_all(attrs={'name':'elements'}))
- [<ul class="list" id="list-1" name="elements">
- <li class="element">Foo</li>
- <li class="element">Bar</li>
- <li class="element">Jay</li>
- </ul>]
- [<ul class="list" id="list-1" name="elements">
- <li class="element">Foo</li>
- <li class="element">Bar</li>
- <li class="element">Jay</li>
- </ul>]
获得的结果
另一种方式
- html = '''
- <div class="panel">
- <div class="panel-heading">
- <h4>Hello</h4>
- </div>
- <div class="panel-body">
- <ul class="list"id="list-1">
- <li class="element">Foo</li>
- <li class="element">Bar</li>
- <li class="element">Jay</li>
- </ul>
- <ul class="list list-small"id="list-2">
- <li class="element">Foo</li>
- <li class="element">Bar</li>
- </ul>
- </div>
- <div>
- '''
- from bs4 import BeautifulSoup
- soup = BeautifulSoup(html,'lxml')
- print(soup.find_all(id='list-1'))
- print(soup.find_all(class_='element'))
另一种方式
- [<ul class="list" id="list-1">
- <li class="element">Foo</li>
- <li class="element">Bar</li>
- <li class="element">Jay</li>
- </ul>]
- [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
--->获得的结果
- html = '''
根据text查找
- #text
- html = '''
- <div class="panel">
- <div class="panel-heading">
- <h4>Hello</h4>
- </div>
- <div class="panel-body"name="elelments">
- <ul class="list"Id="list-1">
- <li class="element">Foo</li>
- <li class="element">Bar</li>
- <li class="element">Jay</li>
- </ul>
- <ul class="list list-small"Id="list-2">
- <li class="element">Foo</li>
- <li class="element">Bar</li>
- </ul>
- </div>
- <div>
- '''
- from bs4 import BeautifulSoup
- soup = BeautifulSoup(html,'lxml')
- print(soup.find_all(text='Foo'))
- #['Foo', 'Foo']
find(name,attrs,recursive,text,**kwargs)返回单个元素,find_all返回所有元素
- html = '''
- <div class="panel">
- <div class="panel-heading">
- <h4>Hello</h4>
- </div>
- <div class="panel-body"name="elelments">
- <ul class="list"Id="list-1">
- <li class="element">Foo</li>
- <li class="element">Bar</li>
- <li class="element">Jay</li>
- </ul>
- <ul class="list list-small"Id="list-2">
- <li class="element">Foo</li>
- <li class="element">Bar</li>
- </ul>
- </div>
- <div>
- '''
- from bs4 import BeautifulSoup
- soup = BeautifulSoup(html,'lxml')
- print(soup.find('ul'))
- print(type(soup.find('ul')))
- print(soup.find('page'))
- <ul class="list" id="list-1">
- <li class="element">Foo</li>
- <li class="element">Bar</li>
- <li class="element">Jay</li>
- </ul>
- <class 'bs4.element.Tag'>
- None
获得的结果
- #text
其他方法
- 如果使用find方法,返回单个元素
- find_parents()返回所有祖先节点
- find_parent()返回直接父节点
- find_next_siblings()返回后面所有兄弟节点
- find_next_sibling()返回后面第一个兄弟节点
- find_previous_siblings()返回前面所有的兄弟节点
- find_previous_sibling()返回前面第一个的兄弟节点
- find_all_next()返回节点后所有符合条件的节点
- find_next()返回节点后第一个符合条件的节点
- find_all_previous()返回节点后所有符合条件的节点
- find_previous()返回第一个符合条件的节点
CSS选择器(通过select()直接传入CSS选择器即可完成选择)
- html = '''
- <div class="panel">
- <div class="panel-heading">
- <h4>Hello</h4>
- </div>
- <div class="panel-body"name="elelments">
- <ul class="list"id="list-1">
- <li class="element">Foo</li>
- <li class="element">Bar</li>
- <li class="element">Jay</li>
- </ul>
- <ul class="list list-small"id="list-2">
- <li class="element">Foo</li>
- <li class="element">Bar</li>
- </ul>
- </div>
- <div>
- '''
- from bs4 import BeautifulSoup
- soup = BeautifulSoup(html,'lxml')
- print(soup.select('.panel .panel-heading')) #class就需要加一个“.”
- print(soup.select('ul li')) #选择标签
- print(soup.select('#list-2 .element'))
- print(type(soup.select('ul')[0]))
- [<div class="panel-heading">
- <h4>Hello</h4>
- </div>]
- [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
- [<li class="element">Foo</li>, <li class="element">Bar</li>]
- <class 'bs4.element.Tag'>
获得的结果
另一种方法:
- html = '''
- <div class="panel">
- <div class="panel-heading">
- <h4>Hello</h4>
- </div>
- <div class="panel-body"name="elelments">
- <ul class="list"Id="list-1">
- <li class="element">Foo</li>
- <li class="element">Bar</li>
- <li class="element">Jay</li>
- </ul>
- <ul class="list list-small"Id="list-2">
- <li class="element">Foo</li>
- <li class="element">Bar</li>
- </ul>
- </div>
- <div>
- '''
- from bs4 import BeautifulSoup
- soup = BeautifulSoup(html,'lxml')
- for ul in soup.select('ul'):#直接print(soup.select('ul li'))
- print(ul.select('li'))
另一种方法
- [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
- [<li class="element">Foo</li>, <li class="element">Bar</li>]
--->获得的结果
- html = '''
获取属性
- html = '''
- <div class="panel">
- <div class="panel-heading">
- <h4>Hello</h4>
- </div>
- <div class="panel-body"name="elelments">
- <ul class="list"id="list-1">
- <li class="element">Foo</li>
- <li class="element">Bar</li>
- <li class="element">Jay</li>
- </ul>
- <ul class="list list-small"id="list-2">
- <li class="element">Foo</li>
- <li class="element">Bar</li>
- </ul>
- </div>
- <div>
- '''
- from bs4 import BeautifulSoup
- soup = BeautifulSoup(html,'lxml')
- for ul in soup.select('ul'):
- print(ul['id'])#直接用[]
- print(ul.attrs['id'])#或是attrs+[]
- list-1
- list-1
- list-2
- list-2
获得的结果
- html = '''
获取内容
- html = '''
- <div class="panel">
- <div class="panel-heading">
- <h4>Hello</h4>
- </div>
- <div class="panel-body"name="elelments">
- <ul class="list"Id="list-1">
- <li class="element">Foo</li>
- <li class="element">Bar</li>
- <li class="element">Jay</li>
- </ul>
- <ul class="list list-small"Id="list-2">
- <li class="element">Foo</li>
- <li class="element">Bar</li>
- </ul>
- </div>
- <div>
- '''
- from bs4 import BeautifulSoup
- soup = BeautifulSoup(html,'lxml')
- for li in soup.select('li'):
- print(li['class'], li.get_text())
- ['element'] Foo
- ['element'] Bar
- ['element'] Jay
- ['element'] Foo
- ['element'] Bar
获得的结果
- html = '''
总结
推荐使用'lxml'解析库,必要时使用html.parser
标签选择器筛选功能但速度快
建议使用find(),find_all()查询匹配单个结果或者多个结果
如果对CSS选择器熟悉建议选用select()
记住常用的获取属性和文本值得方法
PYTHON 爬虫笔记五:BeautifulSoup库基础用法的更多相关文章
- PYTHON 爬虫笔记七:Selenium库基础用法
知识点一:Selenium库详解及其基本使用 什么是Selenium selenium 是一套完整的web应用程序测试系统,包含了测试的录制(selenium IDE),编写及运行(Selenium ...
- PYTHON 爬虫笔记六:PyQuery库基础用法
知识点一:PyQuery库详解及其基本使用 初始化 字符串初始化 html = ''' <div> <ul> <li class="item-0"&g ...
- Python爬虫进阶五之多线程的用法
前言 我们之前写的爬虫都是单个线程的?这怎么够?一旦一个地方卡到不动了,那不就永远等待下去了?为此我们可以使用多线程或者多进程来处理. 首先声明一点! 多线程和多进程是不一样的!一个是 thread ...
- Python爬虫利器:BeautifulSoup库
Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. BeautifulSoup ...
- PYTHON 爬虫笔记三:Requests库的基本使用
知识点一:Requests的详解及其基本使用方法 什么是requests库 Requests库是用Python编写的,基于urllib,采用Apache2 Licensed开源协议的HTTP库,相比u ...
- Python爬虫利器五之Selenium的用法
1.简介 Selenium 是什么?一句话,自动化测试工具.它支持各种浏览器,包括 Chrome,Safari,Firefox 等主流界面式浏览器,如果你在这些浏览器里面安装一个 Selenium 的 ...
- 吴裕雄--天生自然python学习笔记:beautifulsoup库的使用
Beautiful Soup 库简介 Beautiful Soup提供一些简单的.python式的函数用来处理导航.搜索.修改分析树等功能.它是一个工具箱,通过解析文档为用户提供需要抓取的数据,因为简 ...
- 芝麻HTTP: Python爬虫利器之Requests库的用法
前言 之前我们用了 urllib 库,这个作为入门的工具还是不错的,对了解一些爬虫的基本理念,掌握爬虫爬取的流程有所帮助.入门之后,我们就需要学习一些更加高级的内容和工具来方便我们的爬取.那么这一节来 ...
- python爬虫笔记----4.Selenium库(自动化库)
4.Selenium库 (自动化测试工具,支持多种浏览器,爬虫主要解决js渲染的问题) pip install selenium 基本使用 from selenium import webdriver ...
随机推荐
- centos下lvs配置
一.lvs-nat模式 网络配置: lvs-server eth0 :host-only adapter 192.168.56.101 lvs-server eth1 :Internal 192.16 ...
- JVM的GC简介和实例
本文是一次内部分享中总结了jvm gc的分类和一些实例, 内容是introduction级别的,供初学人士参考.成文仓促,难免有些错误,如果有大牛发现,请留言,我一定及时更正,谢谢!JVM内存布局主要 ...
- struts2的BaseAction<T>继承ActionSupport实现ModelDriven<T>
public class BaseAction<T> extends ActionSupport implements ModelDriven<T> { private sta ...
- Node.js学习笔记(6)——使用Express创建一个工程
前提是搭建好了环境,node,npm,express:(推荐全局安装) 开始用express创建一个基础工程: express –t ejs microblog 进入文件夹之后 npm-install ...
- Apache2.4 新virtualhost
创建配置文件 /etc/apache2/sites-available# sudo nano mysite.conf <VirtualHost *:> #ServerName hello. ...
- 电容有什么作用?为什么cpu电源引脚都并联一个电容?
管理 随笔- 17 文章- 1 评论- 1 电容有什么作用?为什么cpu电源引脚都并联一个电容? 正文: 参考资料:http://blog.sina.com.cn/s/blog_7880d3 ...
- ORACLE物化视图具体解释
一.物化的一般使用方法物化视图是一种特殊的物理表,"物化"(Materialized)视图是相对普通视图而言的.普通视图是虚拟表.应用的局限性大,不论什么对视图的查询.oracle ...
- 新手学习JSP+Servlet笔记一
作为一个新手,初次接触jsp,servlet,习惯了后台的开发,前台的知识一窍不通,利用闲暇时间,给自己补补,从MyEclipse开始. 安装好MyEclipse之后,没有安装程序的可以下载 http ...
- MySQL5.7.18 备份、Mysqldump,mysqlpump,xtrabackup,innobackupex 全量,增量备份,数据导入导出
粗略介绍冷备,热备,温暖,及Mysqldump,mysqlpump,xtrabackup,innobackupex 全量,增量备份 --备份的目的 灾难恢复:意外情况下(如服务器宕机.磁盘损坏等)对损 ...
- PHPExcel简易使用教程
在企业里使用PHP进行开发,不可避免总会遇到读/写Excel的需求,遇到这种需求,一般使用PHPExcel类库进行开发. PHPExcel现在最新版本是1.8.0,最低需要PHP5.2版本,支持读取x ...