BeautifulSoup解析 HTML或XML

阅读目录

初识Beautiful Soup

官方文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#

中文文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

Beautiful Soup 是一个可以从HTML或XML文本中提取数据的Python库,它能对HTML、XML格式进行解析成树形结构并提取相关信息。

Beautiful Soup库是一个灵活又方便的网页解析库,处理高效,支持多种解析库(后面会介绍),利用它不用编写正则表达式即可方便地实现网页信息的提取。

安装

Beautiful Soup 3 目前已经停止开发,推荐在现在的项目中使用Beautiful Soup 4,安装方法:

  1. pip install beautifulsoup4

Beautiful Soup库的4种解析器

解析器 使用方法 优势 劣势
Python标准库 BeautifulSoup(markup, "html.parser") Python的内置标准库、执行速度适中 、文档容错能力强 Python 2.7.3 or 3.2.2)前的版本中文容错能力差
lxml HTML 解析器 BeautifulSoup(markup, "lxml") 速度快、文档容错能力强 需要安装C语言库
lxml XML 解析器 BeautifulSoup(markup, "xml") 速度快、唯一支持XML的解析器 需要安装C语言库
html5lib BeautifulSoup(markup, "html5lib") 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 速度慢、不依赖外部扩展

如果仅是想要解析HTML文档,只要用文档创建 BeautifulSoup 对象就可以了。Beautiful Soup会自动选择一个解析器来解析文档.但是还可以通过参数指定使用那种解析器来解析当前文档。BeautifulSoup 第一个参数应该是要被解析的文档字符串或是文件句柄,第二个参数用来标识怎样解析文档.如果第二个参数为空,那么Beautiful Soup根据当前系统安装的库自动选择解析器,解析器的优先数序: lxml, html5lib, Python标准库(python自带的解析库).

安装解析器库:

  1. pip install html5lib
  2. pip install lxml

Beautiful Soup类的基本元素

基本使用

容错处理,文档的容错能力指的是在html代码不完整的情况下,使用该模块可以识别该错误。

使用BeautifulSoup解析上述代码,能够得到一个 BeautifulSoup 的对象,并能按照 标准的缩进格式结构输出

  1. html = """
  2. <html><head><title>The Dormouse's story</title></head>
  3. <body>
  4. <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
  5. <p class="story">Once upon a time there were three little sisters; and their names were
  6. <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
  7. <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
  8. <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
  9. and they lived at the bottom of a well.</p>
  10. <p class="story">...</p>
  11. """
  12. from bs4 import BeautifulSoup
  13. soup = BeautifulSoup(html, 'lxml')
  14. print(soup.prettify()) #处理好缩进,结构化显示
  15. print(soup.title.string)
  1. <html>
  2. <head>
  3. <title>
  4. The Dormouse's story
  5. </title>
  6. </head>
  7. <body>
  8. <p class="title" name="dromouse">
  9. <b>
  10. The Dormouse's story
  11. </b>
  12. </p>
  13. <p class="story">
  14. Once upon a time there were three little sisters; and their names were
  15. <a class="sister" href="http://example.com/elsie" id="link1">
  16. <!-- Elsie -->
  17. </a>
  18. ,
  19. <a class="sister" href="http://example.com/lacie" id="link2">
  20. Lacie
  21. </a>
  22. and
  23. <a class="sister" href="http://example.com/tillie" id="link3">
  24. Tillie
  25. </a>
  26. ;
  27. and they lived at the bottom of a well.
  28. </p>
  29. <p class="story">
  30. ...
  31. </p>
  32. </body>
  33. </html>
  34. The Dormouse's story

输出结果

标签选择器

选择标签元素(存在多个时取第一个)

获取标签名称 + 获取标签 + 获取标签内容 + 获取标签属性

  1. from bs4 import BeautifulSoup
  2. import requests
  3.  
  4. html = """
  5. <html><head><title>The Dormouse's story</title></head>
  6. <body>
  7. <p class="title" name="dromouse"><b>The is pppp</b></p>
  8. <p class="story">Once upon a time there were three little sisters; and their names were
  9. <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
  10. <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
  11. <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
  12. and they lived at the bottom of a well.</p>
  13. <p class="story">...</p>
  14. """
  15.  
  16. soup = BeautifulSoup(html, 'lxml')
  17.  
  18. print(soup.title) #获取改标签 <title>The Dormouse's story</title>
  19. print(soup.title.name) #获取标签名
  20.  
  21. print(soup.title.text) #获取标签内容
  22. print(soup.p.text)
  23. print(soup.p.string)
  24.  
  25. dic = soup.p.attrs #获取 p标签所有属性返回一个字典结构
  26. print(dic) #获取 p标签所有属性返回一个字典结构
  27. print(dic["name"])
  28. print(soup.p.attrs["class"]) #获取指定属性值,返回列表
  29. print(soup.p["class"])

打印输出:

  1. <title>The Dormouse's story</title>
  2. title
  3. The Dormouse's story
  4. The is pppp
  5. The is pppp
  6. {'class': ['title'], 'name': 'dromouse'}
  7. dromouse
  8. ['title']
  9. ['title']

标签嵌套选择

  1. html = """
  2. <html><head><title>The Dormouse's story</title></head>
  3. <body>
  4. <div class="title" name="dromouse"><b class='bb bcls xiong'>The Dormouse's story</b></div>
  5. <p class="story">Once upon a time there were three little sisters; and their names were
  6. <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
  7. <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
  8. <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
  9. and they lived at the bottom of a well.</p>
  10. <p class="story">...</p>
  11. """
  12. soup = BeautifulSoup(html, 'lxml')
  13.  
  14. print(soup.div.b['class']) #标签嵌套选择
  15.  
  16. print(soup.p.stripped_strings) #<generator object stripped_strings at 0x000002C7CC772830>
  17. print(list(soup.p.stripped_strings))
  18. print(soup.p.text)

打印输出:

  1. ['bb', 'bcls', 'xiong']
  2. <generator object stripped_strings at 0x000002471D323830>
  3. ['Once upon a time there were three little sisters; and their names were', ',', 'Lacie', 'and', 'Tillie', ';\nand they lived at the bottom of a well.']
  4. Once upon a time there were three little sisters; and their names were
  5. ,
  6. Lacie and
  7. Tillie;
  8. and they lived at the bottom of a well.

节点操作

子节点和子孙节点

对于一个标签的儿子节点不仅包括标签节点,也包括字符串节点,空格表示为'\n'

  1. html = """
  2. <html>
  3. <head>
  4. <title>The Dormouse's story</title>
  5. </head>
  6. <body>
  7. <p class="story">
  8. Once upon a time there were three little sisters; and their names were
  9. <a href="http://example.com/elsie" class="sister" id="link1">
  10. <span>Elsie</span>
  11. </a>
  12. <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
  13. and
  14. <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
  15. and they lived at the bottom of a well.
  16. </p>
  17. <p class="story">...</p>
  18. """
  19. soup = BeautifulSoup(html, 'lxml')
  20.  
  21. print(soup.p.contents) #子节点列表,将<p>所有子节点存在列表中
  22.  
  23. print("======================================================================>")
  24. print(soup.p.children) #子节点的可迭代类型,<list_iterator object at 0x0000029154DF7FD0>
  25. for i, child in enumerate(soup.p.children):
  26. print(i, str(child).strip()) #child 是bs4.element 对象
  27.  
  28. print("======================================================================>")
  29. print(soup.p.descendants) #子孙节点的迭代类型,<generator object descendants at 0x000001C7583D2888>
  30. for i, child in enumerate(soup.p.descendants):
  31. print(i, child)

打印输出:

  1. ['\n Once upon a time there were three little sisters; and their names were\n ', <a class="sister" href="http://example.com/elsie" id="link1">
  2. <span>Elsie</span>
  3. </a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, '\n and\n ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n and they lived at the bottom of a well.\n ']
  4. ======================================================================>
  5. <list_iterator object at 0x000001C2E2AB6EF0>
  6. 0 Once upon a time there were three little sisters; and their names were
  7. 1 <a class="sister" href="http://example.com/elsie" id="link1">
  8. <span>Elsie</span>
  9. </a>
  10. 2
  11. 3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  12. 4 and
  13. 5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
  14. 6 and they lived at the bottom of a well.
  15. ======================================================================>
  16. <generator object descendants at 0x000001C2E2AA3830>
  17. 0
  18. Once upon a time there were three little sisters; and their names were
  19.  
  20. 1 <a class="sister" href="http://example.com/elsie" id="link1">
  21. <span>Elsie</span>
  22. </a>
  23. 2
  24.  
  25. 3 <span>Elsie</span>
  26. 4 Elsie
  27. 5
  28.  
  29. 6
  30.  
  31. 7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  32. 8 Lacie
  33. 9
  34. and
  35.  
  36. 10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
  37. 11 Tillie
  38. 12
  39. and they lived at the bottom of a well.

父节点和祖先节点

  1. html = """
  2. <html>
  3. <head>
  4. <title>The Dormouse's story</title>
  5. </head>
  6. <body>
  7. <p class="story">
  8. Once upon a time there were three little sisters; and their names were
  9. <a href="http://example.com/elsie" class="sister" id="link1">
  10. <span>Elsie</span>
  11. </a>
  12. <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
  13. and
  14. <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
  15. and they lived at the bottom of a well.
  16. </p>
  17. <p class="story">...</p>
  18. """
  19. soup = BeautifulSoup(html, 'lxml')
  20.  
  21. print(soup.a.parent)
  22.  
  23. print("========================================================================>")
  24. print(soup.a.parents) #祖先节点,返回可迭代类型
  25. for item in soup.a.parents:
  26. print(item)

打印输出:

  1. <p class="story">
  2. Once upon a time there were three little sisters; and their names were
  3. <a class="sister" href="http://example.com/elsie" id="link1">
  4. <span>Elsie</span>
  5. </a>
  6. <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  7. and
  8. <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
  9. and they lived at the bottom of a well.
  10. </p>
  11. ========================================================================>
  12. <generator object parents at 0x000001A078752830>
  13. <p class="story">
  14. Once upon a time there were three little sisters; and their names were
  15. <a class="sister" href="http://example.com/elsie" id="link1">
  16. <span>Elsie</span>
  17. </a>
  18. <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  19. and
  20. <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
  21. and they lived at the bottom of a well.
  22. </p>
  23. <body>
  24. <p class="story">
  25. Once upon a time there were three little sisters; and their names were
  26. <a class="sister" href="http://example.com/elsie" id="link1">
  27. <span>Elsie</span>
  28. </a>
  29. <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  30. and
  31. <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
  32. and they lived at the bottom of a well.
  33. </p>
  34. <p class="story">...</p>
  35. </body>
  36. <html>
  37. <head>
  38. <title>The Dormouse's story</title>
  39. </head>
  40. <body>
  41. <p class="story">
  42. Once upon a time there were three little sisters; and their names were
  43. <a class="sister" href="http://example.com/elsie" id="link1">
  44. <span>Elsie</span>
  45. </a>
  46. <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  47. and
  48. <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
  49. and they lived at the bottom of a well.
  50. </p>
  51. <p class="story">...</p>
  52. </body></html>
  53. <html>
  54. <head>
  55. <title>The Dormouse's story</title>
  56. </head>
  57. <body>
  58. <p class="story">
  59. Once upon a time there were three little sisters; and their names were
  60. <a class="sister" href="http://example.com/elsie" id="link1">
  61. <span>Elsie</span>
  62. </a>
  63. <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  64. and
  65. <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
  66. and they lived at the bottom of a well.
  67. </p>
  68. <p class="story">...</p>
  69. </body></html>

兄弟节点

  1. html = """
  2. <html>
  3. <head>
  4. <title>The Dormouse's story</title>
  5. </head>
  6. <body>
  7. <p class="story">
  8. Once upon a time there were three little sisters; and their names were
  9. <a href="http://example.com/elsie" class="sister" id="link1">
  10. <span>Elsie</span>
  11. </a>
  12. <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
  13. and
  14. <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
  15. and they lived at the bottom of a well.
  16. </p>
  17. <p class="story">...</p>
  18. """
  19. soup = BeautifulSoup(html, 'lxml')
  20.  
  21. print(list(enumerate(soup.a.next_sibling))) #下一个兄弟节点
  22. print(list(enumerate(soup.a.next_siblings))) #下面所有的兄弟节点
  23. print(list(enumerate(soup.a.previous_sibling))) #上一个兄弟节点
  24. print(list(enumerate(soup.a.previous_siblings))) #上面所有的兄弟节点

打印输出:

  1. [(0, '\n')]
  2. [(0, '\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, '\n and\n '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n and they lived at the bottom of a well.\n ')]
  3. [(0, '\n'), (1, ' '), (2, ' '), (3, ' '), (4, ' '), (5, ' '), (6, ' '), (7, ' '), (8, ' '), (9, ' '), (10, ' '), (11, ' '), (12, ' '), (13, 'O'), (14, 'n'), (15, 'c'), (16, 'e'), (17, ' '), (18, 'u'), (19, 'p'), (20, 'o'), (21, 'n'), (22, ' '), (23, 'a'), (24, ' '), (25, 't'), (26, 'i'), (27, 'm'), (28, 'e'), (29, ' '), (30, 't'), (31, 'h'), (32, 'e'), (33, 'r'), (34, 'e'), (35, ' '), (36, 'w'), (37, 'e'), (38, 'r'), (39, 'e'), (40, ' '), (41, 't'), (42, 'h'), (43, 'r'), (44, 'e'), (45, 'e'), (46, ' '), (47, 'l'), (48, 'i'), (49, 't'), (50, 't'), (51, 'l'), (52, 'e'), (53, ' '), (54, 's'), (55, 'i'), (56, 's'), (57, 't'), (58, 'e'), (59, 'r'), (60, 's'), (61, ';'), (62, ' '), (63, 'a'), (64, 'n'), (65, 'd'), (66, ' '), (67, 't'), (68, 'h'), (69, 'e'), (70, 'i'), (71, 'r'), (72, ' '), (73, 'n'), (74, 'a'), (75, 'm'), (76, 'e'), (77, 's'), (78, ' '), (79, 'w'), (80, 'e'), (81, 'r'), (82, 'e'), (83, '\n'), (84, ' '), (85, ' '), (86, ' '), (87, ' '), (88, ' '), (89, ' '), (90, ' '), (91, ' '), (92, ' '), (93, ' '), (94, ' '), (95, ' ')]
  4. [(0, '\n Once upon a time there were three little sisters; and their names were\n ')]

标准选择器 find/find_all(* * * * *)

基于bs4库的HTML内容查找方法

  1. <>.find_all(name,attrs,recursive,text,**kwargs) # 返回一个列表类型,存储查找的结果
  2. name 对标签名称的检索字符串
  3. attrs 对标签属性值的检索字符串,可标注属性检索
  4. recursive 是否对子孙全部搜索,默认True
  5. text 对文本内容进行检索

其他的 find 方法:

find_all( name , attrs , recursive , text , **kwargs )

可根据标签名、属性、内容查找文档

name

  1. html='''
  2. <div class="panel">
  3. <div class="panel-heading">
  4. <h4>Hello</h4>
  5. </div>
  6. <div class="panel-body">
  7. <ul class="list" id="list-1">
  8. <li class="element">Foo</li>
  9. <li class="element">Bar</li>
  10. <li class="element">Jay</li>
  11. </ul>
  12. <ul class="list list-small" id="list-2">
  13. <li class="element">Foo</li>
  14. <li class="element">Bar</li>
  15. </ul>
  16. </div>
  17. </div>
  18. '''
  19. from bs4 import BeautifulSoup
  20. soup = BeautifulSoup(html, 'lxml')
  21. print(soup.find_all('ul'))
  22. print(type(soup.find_all('ul')[0]))
  1. [<ul class="list" id="list-1">
  2. <li class="element">Foo</li>
  3. <li class="element">Bar</li>
  4. <li class="element">Jay</li>
  5. </ul>, <ul class="list list-small" id="list-2">
  6. <li class="element">Foo</li>
  7. <li class="element">Bar</li>
  8. </ul>]
  9. <class 'bs4.element.Tag'>
  1. html='''
  2. <div class="panel">
  3. <div class="panel-heading">
  4. <h4>Hello</h4>
  5. </div>
  6. <div class="panel-body">
  7. <ul class="list" id="list-1">
  8. <li class="element">Foo</li>
  9. <li class="element">Bar</li>
  10. <li class="element">Jay</li>
  11. </ul>
  12. <ul class="list list-small" id="list-2">
  13. <li class="element">Foo</li>
  14. <li class="element">Bar</li>
  15. </ul>
  16. </div>
  17. </div>
  18. '''
  19. from bs4 import BeautifulSoup
  20. soup = BeautifulSoup(html, 'lxml')
  21. for ul in soup.find_all('ul'):
  22. print(ul.find_all('li'))
  1. [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
  2. [<li class="element">Foo</li>, <li class="element">Bar</li>]

属性attrs

  1. html='''
  2. <div class="panel">
  3. <div class="panel-heading">
  4. <h4>Hello</h4>
  5. </div>
  6. <div class="panel-body">
  7. <ul class="list" id="list-1" name="elements">
  8. <li class="element">Foo</li>
  9. <li class="element">Bar</li>
  10. <li class="element">Jay</li>
  11. </ul>
  12. <ul class="list2 list-small" id="list-2">
  13. <li class="element">Foo</li>
  14. <li class="element">Bar</li>
  15. </ul>
  16. </div>
  17. </div>
  18. '''
  19. soup = BeautifulSoup(html, 'lxml')
  20.  
  21. print(soup.find_all(attrs={'id': 'list-1'})) #推荐这种写法
  22. print(soup.find_all(id="list-1")) #类似于**kwargs传值,与上一种写法效果相同
  23.  
  24. print(soup.find_all(attrs={'class': 'list-small'}))
  25. print(soup.find_all(class_="list2"))

打印输出:

  1. [<ul class="list" id="list-1" name="elements">
  2. <li class="element">Foo</li>
  3. <li class="element">Bar</li>
  4. <li class="element">Jay</li>
  5. </ul>]
  6. [<ul class="list" id="list-1" name="elements">
  7. <li class="element">Foo</li>
  8. <li class="element">Bar</li>
  9. <li class="element">Jay</li>
  10. </ul>]
  11. [<ul class="list2 list-small" id="list-2">
  12. <li class="element">Foo</li>
  13. <li class="element">Bar</li>
  14. </ul>]
  15. [<ul class="list2 list-small" id="list-2">
  16. <li class="element">Foo</li>
  17. <li class="element">Bar</li>
  18. </ul>]

text

  1. html='''
  2. <div class="panel">
  3. <div class="panel-heading">
  4. <h4>Hello</h4>
  5. </div>
  6. <div class="panel-body">
  7. <ul class="list" id="list-1">
  8. <li class="element">Foo</li>
  9. <li class="element">Bar</li>
  10. <li class="element">Jay</li>
  11. </ul>
  12. <ul class="list list-small" id="list-2">
  13. <li class="element">Foo</li>
  14. <li class="element">Bar</li>
  15. </ul>
  16. </div>
  17. </div>
  18. '''
  19. from bs4 import BeautifulSoup
  20. soup = BeautifulSoup(html, 'lxml')
  21. print(soup.find_all(text='Foo'))
  1. ['Foo', 'Foo']

find( name , attrs , recursive , text , **kwargs )

find返回单个元素,find_all返回所有元素

  1. html='''
  2. <div class="panel">
  3. <div class="panel-heading">
  4. <h4>Hello</h4>
  5. </div>
  6. <div class="panel-body">
  7. <ul class="list" id="list-1">
  8. <li class="element">Foo</li>
  9. <li class="element">Bar</li>
  10. <li class="element">Jay</li>
  11. </ul>
  12. <ul class="list list-small" id="list-2">
  13. <li class="element">Foo</li>
  14. <li class="element">Bar</li>
  15. </ul>
  16. </div>
  17. </div>
  18. '''
  19. from bs4 import BeautifulSoup
  20. soup = BeautifulSoup(html, 'lxml')
  21. print(soup.find('ul'))
  22. print(type(soup.find('ul')))
  23. print(soup.find('page'))
  1. <ul class="list" id="list-1">
  2. <li class="element">Foo</li>
  3. <li class="element">Bar</li>
  4. <li class="element">Jay</li>
  5. </ul>
  6. <class 'bs4.element.Tag'>
  7. None

find_parents() find_parent()

find_parents()返回所有祖先节点,find_parent()返回直接父节点。

find_next_siblings() find_next_sibling()

find_next_siblings()返回后面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点。

find_previous_siblings() find_previous_sibling()

find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点。

find_all_next() find_next()

find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点

find_all_previous() 和 find_previous()

find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点

CSS选择器(* * * * * )

通过select()直接传入CSS选择器即可完成选择

  1. html='''
  2. <div class="panel">
  3. <div class="panel-heading">
  4. <h4>Hello</h4>
  5. </div>
  6. <div class="panel-heading">
  7. <h4>World</h4>
  8. </div>
  9. <div class="panel-body">
  10. <ul class="list" id="list-1">
  11. <li class="element">Foo</li>
  12. <li class="element">Bar</li>
  13. <li class="element">Jay</li>
  14. </ul>
  15. <ul class="list list-small" id="list-2">
  16. <li class="element">Foo</li>
  17. <li class="element">Bar</li>
  18. </ul>
  19. </div>
  20. </div>
  21. '''
  22. soup = BeautifulSoup(html, 'lxml')
  23.  
  24. print(soup.select('.panel .panel-heading'))
  25. print(soup.select('ul li'))
  26. print(soup.select('#list-2 .element'))
  27. print(type(soup.select('ul')[0]))

输出结果:

  1. [<div class="panel-heading">
  2. <h4>Hello</h4>
  3. </div>, <div class="panel-heading">
  4. <h4>World</h4>
  5. </div>]
  6. [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
  7. [<li class="element">Foo</li>, <li class="element">Bar</li>]
  8. <class 'bs4.element.Tag'>
  1. html='''
  2. <div class="panel">
  3. <div class="panel-heading">
  4. <h4>Hello</h4>
  5. </div>
  6. <div class="panel-body">
  7. <ul class="list" id="list-1">
  8. <li class="element">Foo</li>
  9. <li class="element">Bar</li>
  10. <li class="element">Jay</li>
  11. </ul>
  12. <ul class="list list-small" id="list-2">
  13. <li class="element">Foo</li>
  14. <li class="element">Bar</li>
  15. </ul>
  16. </div>
  17. </div>
  18. '''
  19. from bs4 import BeautifulSoup
  20. soup = BeautifulSoup(html, 'lxml')
  21. for ul in soup.select('ul'):
  22. print(ul.select('li'))
  1. [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
  2. [<li class="element">Foo</li>, <li class="element">Bar</li>]

获取属性

ul.attrs['id']

ul['id']

  1. html='''
  2. <div class="panel">
  3. <div class="panel-heading">
  4. <h4>Hello</h4>
  5. </div>
  6. <div class="panel-body">
  7. <ul class="list" id="list-1">
  8. <li class="element">Foo</li>
  9. <li class="element">Bar</li>
  10. <li class="element">Jay</li>
  11. </ul>
  12. <ul class="list list-small" id="list-2">
  13. <li class="element">Foo</li>
  14. <li class="element">Bar</li>
  15. </ul>
  16. </div>
  17. </div>
  18. '''
  19. from bs4 import BeautifulSoup
  20. soup = BeautifulSoup(html, 'lxml')
  21. for ul in soup.select('ul'):
  22. print(ul['id'])
  23. print(ul.attrs['id'])
  1. list-1
  2. list-1
  3. list-2
  4. list-2

获取内容

li.get_text()

  1. html='''
  2. <div class="panel">
  3. <div class="panel-heading">
  4. <h4>Hello</h4>
  5. </div>
  6. <div class="panel-body">
  7. <ul class="list" id="list-1">
  8. <li class="element">Foo</li>
  9. <li class="element">Bar</li>
  10. <li class="element">Jay</li>
  11. </ul>
  12. <ul class="list list-small" id="list-2">
  13. <li class="element">Foo</li>
  14. <li class="element">Bar</li>
  15. </ul>
  16. </div>
  17. </div>
  18. '''
  19. from bs4 import BeautifulSoup
  20. soup = BeautifulSoup(html, 'lxml')
  21. for li in soup.select('li'):
  22. print(li.get_text())
  1. Foo
  2. Bar
  3. Jay
  4. Foo
  5. Bar

总结:

  • 推荐使用lxml解析库,必要时使用html.parser
  • 标签选择筛选功能弱但是速度快
  • 建议使用find()、find_all() 查询匹配单个结果或者多个结果
  • 如果对CSS选择器熟悉建议使用select()

实例:中国大学排名爬虫

步骤1:从网络上获取大学排名网页内容getHTMLText()
步骤2:提取网页内容中信息到合适的数据结构fillUnivList()
步骤3:利用数据结构展示并输出结果printUnivLise()

  1. import requests
  2. from bs4 import BeautifulSoup
  3. import bs4
  4.  
  5. def getHTMLText(url):
  6. try:
  7. r = requests.get(url, timeout=30)
  8. r.raise_for_status()
  9. r.encoding = r.apparent_encoding
  10. return r.text
  11. except:
  12. return "error"
  13.  
  14. def fillUnivList(ulist, html):
  15. soup = BeautifulSoup(html, "html.parser")
  16. for tr in soup.find('tbody').children:
  17. if isinstance(tr, bs4.element.Tag): # 过滤掉非标签类型
  18. tds = tr('td')
  19. ulist.append([tds[0].string, tds[1].string, tds[3].string])
  20.  
  21. # 中文对齐问题的解决:
  22. # 采用中文字符的空格填充 chr(12288)
  23. def printUnivList(ulist, num):
  24. tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"
  25. print(tplt.format("排名", "学校名称", "总分", chr(12288)))
  26. for i in range(num):
  27. u = ulist[i]
  28. print(tplt.format(u[0], u[1], u[2], chr(12288)))
  29.  
  30. def main():
  31. uinfo = []
  32. url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2019.html'
  33. html = getHTMLText(url)
  34. fillUnivList(uinfo, html)
  35. printUnivList(uinfo, 20)
  36.  
  37. if __name__ == '__main__':
  38. main()

代码

采集到的数据使用pyecharts进行数据可视化展示

  1. import requests,json,re,bs4
  2. from bs4 import BeautifulSoup
  3.  
  4. header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3472.3 Safari/537.36'}
  5.  
  6. def getHtmlText(url):
  7. try:
  8. ret = requests.get(url , headers=header , timeout=30)
  9. ret.encoding = "utf8"
  10. ret.raise_for_status()
  11. return ret.text
  12. except:
  13. return None
  14.  
  15. def fillUnivList(ulist,html):
  16. soup = BeautifulSoup(html,"lxml")
  17. for tr in soup.tbody.children:
  18. if isinstance(tr, bs4.element.Tag): #判断tr是否是bs4.element.Tag类型
  19. tds = tr("td")
  20. # print(tds)
  21. ulist.append([tds[0].string,tds[1].string,tds[2].string,tds[3].string])
  22.  
  23. # 中文对齐问题的解决:
  24. # 采用中文字符的空格填充 chr(12288)
  25. def printUnivList(ulist, num):
  26. tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"
  27. print(tplt.format("排名", "学校名称", "总分", chr(12288)))
  28. for i in range(num):
  29. u = ulist[i]
  30. print(tplt.format(u[0], u[1], u[3], chr(12288)))
  31.  
  32. #pyecharts数据可视化展示
  33. def showData(ulist,num):
  34. from pyecharts import Bar
  35. attrs = []
  36. vals = []
  37. for i in range(num):
  38. attrs.append(ulist[i][1])
  39. vals.append(ulist[i][3])
  40. bar = Bar("2019中国大学排行榜")
  41. bar.add(
  42. "中国大学排行榜",
  43. attrs,
  44. vals,
  45. is_datazoom_show=True,
  46. datazoom_type="both",
  47. datazoom_range=[0, 10],
  48. xaxis_rotate=30,
  49. xaxis_label_textsize=8,
  50. is_label_show=True,
  51. )
  52. bar.render("2019中国大学排行榜4.html")
  53.  
  54. def showData_funnel(ulist,num):
  55. from pyecharts import Funnel
  56. attrs = []
  57. vals = []
  58. for i in range(num):
  59. attrs.append(ulist[i][1])
  60. vals.append(ulist[i][3])
  61. funnel = Funnel(width=1000,height=800)
  62. funnel.add(
  63. "大学排行榜",
  64. attrs,
  65. vals,
  66. is_label_show=True,
  67. label_pos="inside",
  68. label_text_color="#fff",
  69. )
  70. funnel.render("2019中国大学排行榜4.html")
  71.  
  72. def main():
  73. uinfo = []
  74. url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2019.html'
  75. html = getHtmlText(url)
  76. fillUnivList(uinfo, html)
  77. print(uinfo)
  78. # showData(uinfo,100)
  79. showData_funnel(uinfo,20)
  80. # printUnivList(uinfo, 30)
  81.  
  82. if __name__ == '__main__':
  83. main()

代码

补充1:

Python中内建函数isinstance的用法

语法:isinstance(object,type)

作用:来判断一个对象是否是一个已知的类型。

其第一个参数(object)为对象,第二个参数(type)为类型名(int...)或类型名的一个列表((int,list,float)是一个列表)。其返回值为布尔型(True or flase)。

若对象的类型与参数二的类型相同则返回True。若参数二为一个元组,则若对象类型与元组中类型名之一相同即返回True。

下面是两个例子:

例一

>>> a = 4
>>> isinstance (a,int)
True
>>> isinstance (a,str)
False
>>> isinstance (a,(str,int,list))
True

例二

>>> a = "b"
>>> isinstance(a,str)
True
>>> isinstance(a,int)
False
>>> isinstance(a,(int,list,float))
False
>>> isinstance(a,(int,list,float,str))
True

补充2:

Response.raise_for_status()

如果发送了一个错误请求(一个 4XX 客户端错误,或者 5XX 服务器错误响应),我们可以通过 Response.raise_for_status() 来抛出异常:

  1. >>> bad_r = requests.get('http://httpbin.org/status/404')
  2. >>> bad_r.status_code
  3. 404
  1. >>> bad_r.raise_for_status()
  2. Traceback (most recent call last):
  3. File "requests/models.py", line 832, in raise_for_status
  4. raise http_error
  5. requests.exceptions.HTTPError: 404 Client Error

但是,由于我们的例子中 r 的 status_code 是 200 ,当我们调用 raise_for_status()时,得到的是:

  1. >>> r.raise_for_status()
  2. None

参考:

http://www.cnblogs.com/0bug/p/8260834.html

http://pyecharts.org/#/

https://www.cnblogs.com/kongzhagen/p/6472746.html

https://www.cnblogs.com/haiyan123/p/8289560.html

https://www.cnblogs.com/haiyan123/p/8317398.html

【Python爬虫】BeautifulSoup网页解析库的更多相关文章

  1. Python_爬虫_BeautifulSoup网页解析库

    BeautifulSoup网页解析库 from bs4 import BeautifulSoup 0.BeautifulSoup网页解析库包含 的 几个解析器 Python标准库[主要,系统自带;] ...

  2. 【Python爬虫】PyQuery解析库

    PyQuery解析库 阅读目录 初始化 基本CSS选择器 查找元素 遍历 获取信息 DOM操作 伪类选择器 PyQuery 是 Python 仿照 jQuery 的严格实现.语法与 jQuery 几乎 ...

  3. Python爬虫3大解析库使用导航

    1. Xpath解析库 2. BeautifulSoup解析库 3. PyQuery解析库

  4. python爬虫之网页解析

    CSS Selector 与Xpath path = ‘D:\\Postgraduate\\Python\\python_projects\\Python视频 分布式 爬虫Scrapy入门到精通\\第 ...

  5. Python爬虫-- BeautifulSoup库

    BeautifulSoup库 beautifulsoup就是一个非常强大的工具,爬虫利器.一个灵活又方便的网页解析库,处理高效,支持多种解析器.利用它就不用编写正则表达式也能方便的实现网页信息的抓取 ...

  6. Python网页解析库:用requests-html爬取网页

    Python网页解析库:用requests-html爬取网页 1. 开始 Python 中可以进行网页解析的库有很多,常见的有 BeautifulSoup 和 lxml 等.在网上玩爬虫的文章通常都是 ...

  7. Python的网页解析库-PyQuery

    PyQuery库也是一个非常强大又灵活的网页解析库,如果你有前端开发经验的,都应该接触过jQuery,那么PyQuery就是你非常绝佳的选择,PyQuery 是 Python 仿照 jQuery 的严 ...

  8. python爬虫知识点总结(一)库的安装

    环境要求: 1.编程语言版本python3: 2.系统:win10; 3.浏览器:Chrome68.0.3440.75:(如果不是最新版有可能影响到程序执行) 4.chromedriver2.41 注 ...

  9. python爬虫---BeautifulSoup的用法

    BeautifulSoup是一个灵活的网页解析库,不需要编写正则表达式即可提取有效信息. 推荐使用lxml作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前 ...

随机推荐

  1. tensorflow模型量化

    tensorflow模型量化/DATA/share/DeepLearning/code/tensorflow/bazel-bin/tensorflow/tools/graph_transforms/t ...

  2. linux memcached

    依赖库 yum install libevent libevent-deve 云安装 yum install memcached 源代码安装 wget http://memcached.org/lat ...

  3. @Autowired注入为null问题分析

    题说明 最近看到Spring事务,在学习过程中遇到一个很苦恼问题 搭建好Spring的启动环境后出现了一点小问题 在启动时候却出现[java.lang.NullPointerException] 不过 ...

  4. Chrome浏览器端调试JavaScript

    1. 一个超级简单的html文件拉入chrome浏览器 2. 右键-->检查 3. 点击Sources 4. html的12行加个断点 5. 刷新页面,点开Console面板,输入变量num,我 ...

  5. 【ArcGIS】栅格分析-问题之001(转)

    在arcgis中进行栅格计算时,碰到这样的错误ERROR 000539:Error running expression:rcexec()<type 'exceptions.ValueError ...

  6. SVN 命令行的使用

    大多数时候我们用TortoiseSVN作为客户端,其实SVN提供了强大的客户端命令行工具,和Git差不不多. 1. 查看工作副本修改的整体状况. $ svn status ? scratch.c A ...

  7. 【JS加密库】SJCL :斯坦福大学JS加密库

    斯坦福大学Javascript加密库简称SJCL,是一个由斯坦福大学计算机安全实验室创立的项目,旨在创建一个安全.快速.短小精悍.易使用.跨浏览器的JavaScript加密库. 斯坦福大学下载地址:h ...

  8. [原]openstack-kilo--issue(九) heat stacks topology中图形无法正常显示

    本博客已经添加"打赏"功能,"打赏"位置位于右边栏红色框中,感谢您赞助的咖啡. ======声明======= 欢迎转载:转载请注明出处 http://www. ...

  9. Spark LogisticRegression 逻辑回归之建模

    导入包 import org.apache.spark.sql.SparkSession import org.apache.spark.sql.Dataset import org.apache.s ...

  10. 9.26/27 blog项目

    2018-9-26 18:05:20 放上一个老男孩b站视频连接 :https://shimo.im/docs/VN0BLgAIBdMVSa4S/ b站连接: https://space.bilibi ...