知识点一:BeautifulSoup库详解及其基本使用方法

  • 什么是BeautifulSoup

灵活又方便的网页解析库,处理高效,支持多种解析器。利用它不用编写正则表达式即可方便实现网页信息的提取库。

  • BeautifulSoup中常见的解析库

        

  • 基本用法:

    1. html = '''
    2. <html><head><title>The Domouse's story</title></head>
    3. <body>
    4. <p class="title"name="dromouse"><b>The Dormouse's story</b></p>
    5. <p class="story">Once upon a time there were little sisters;and their names were
    6. <a href="http://example.com/elsie"class="sister"id="link1"><!--Elsie--></a>
    7. <a hred="http://example.com/lacle"class="sister"id="link2">Lacle</a>and
    8. <a hred="http://example.com/tilie"class="sister"id="link3">Tillie</a>
    9. and they lived at bottom of a well.</p>
    10. <p class="story">...</p>
    11. '''
    12.  
    13. from bs4 import BeautifulSoup
    14. soup= BeautifulSoup(html,'lxml')
    15.  
    16. print(soup.prettify())#格式化代码,打印结果自动补全缺失的代码
    17. print(soup.title.string)#文章标题
    1. <html>
    2. <head>
    3. <title>
    4. The Domouse's story
    5. </title>
    6. </head>
    7. <body>
    8. <p class="title" name="dromouse">
    9. <b>
    10. The Dormouse's story
    11. </b>
    12. </p>
    13. <p class="story">
    14. Once upon a time there were little sisters;and their names were
    15. <a class="sister" href="http://example.com/elsie" id="link1">
    16. <!--Elsie-->
    17. </a>
    18. <a class="sister" hred="http://example.com/lacle" id="link2">
    19. Lacle
    20. </a>
    21. and
    22. <a class="sister" hred="http://example.com/tilie" id="link3">
    23. Tillie
    24. </a>
    25. and they lived at bottom of a well.
    26. </p>
    27. <p class="story">
    28. ...
    29. </p>
    30. </body>
    31. </html>
    32. The Domouse's story

    获得的结果

  1. 标签选择器

    1. 选择元素

      1. html = '''
      2. <html><head><title>The Domouse's story</title></head>
      3. <body>
      4. <p class="title"name="dromouse"><b>The Dormouse's story</b></p>
      5. <p class="story">Once upon a time there were little sisters;and their names were
      6. <a href="http://example.com/elsie"class="sister"id="link1"><!--Elsie--></a>
      7. <a hred="http://example.com/lacle"class="sister"id="link2">Lacle</a>and
      8. <a hred="http://example.com/tilie"class="sister"id="link3">Tillie</a>
      9. and they lived at bottom of a well.</p>
      10. <p class="story">...</p>
      11. '''
      12. from bs4 import BeautifulSoup
      13. soup = BeautifulSoup(html,'lxml')
      14. print(soup.title)
      15. #<title>The Domouse's story</title>
      16. print(type(soup.title))
      17. #<class 'bs4.element.Tag'>
      18. print(soup.head)
      19. #<head><title>The Domouse's story</title></head>
      20. print(soup.p)#当出现多个时,只返回第一个
      21. #<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
    2. 获取标签名称

      1. html = '''
      2. <html><head><title>The Domouse's story</title></head>
      3. <body>
      4. <p class="title"name="dromouse"><b>The Dormouse's story</b></p>
      5. <p class="story">Once upon a time there were little sisters;and their names were
      6. <a href="http://example.com/elsie"class="sister"id="link1"><!--Elsie--></a>
      7. <a hred="http://example.com/lacle"class="sister"id="link2">Lacle</a>and
      8. <a hred="http://example.com/tilie"class="sister"id="link3">Tillie</a>
      9. and they lived at bottom of a well.</p>
      10. <p class="story">...</p>
      11. '''
      12. from bs4 import BeautifulSoup
      13. soup = BeautifulSoup(html,'lxml')
      14. print(soup.title.name)
      15. #title
    3. 获取属性

      1. html = '''
      2. <html><head><title>The Domouse's story</title></head>
      3. <body>
      4. <p class="title"name="dromouse"><b>The Dormouse's story</b></p>
      5. <p class="story">Once upon a time there were little sisters;and their names were
      6. <a href="http://example.com/elsie"class="sister"id="link1"><!--Elsie--></a>
      7. <a hred="http://example.com/lacle"class="sister"id="link2">Lacle</a>and
      8. <a hred="http://example.com/tilie"class="sister"id="link3">Tillie</a>
      9. and they lived at bottom of a well.</p>
      10. <p class="story">...</p>
      11. '''
      12. from bs4 import BeautifulSoup
      13. soup = BeautifulSoup(html,'lxml')
      14.  
      15. print(soup.p.attrs['name'])
      16. #dromouse
      17. print(soup.p['name'])
      18. #dromouse
    4. 获取标签内容

      1. html = '''
      2. <html><head><title>The Domouse's story</title></head>
      3. <body>
      4. <p class="title"name="dromouse"><b>The Dormouse's story</b></p>
      5. <p class="story">Once upon a time there were little sisters;and their names were
      6. <a href="http://example.com/elsie"class="sister"id="link1"><!--Elsie--></a>
      7. <a hred="http://example.com/lacle"class="sister"id="link2">Lacle</a>and
      8. <a hred="http://example.com/tilie"class="sister"id="link3">Tillie</a>
      9. and they lived at bottom of a well.</p>
      10. <p class="story">...</p>
      11. '''
      12. from bs4 import BeautifulSoup
      13. soup = BeautifulSoup(html,'lxml')
      14.  
      15. print(soup.p.string)
      16. #The Dormouse's story
    5. 嵌套选择

      1. html = '''
      2. <html><head><title>The Domouse's story</title></head>
      3. <body>
      4. <p class="title"name="dromouse"><b>The Dormouse's story</b></p>
      5. <p class="story">Once upon a time there were little sisters;and their names were
      6. <a href="http://example.com/elsie"class="sister"id="link1"><!--Elsie--></a>
      7. <a hred="http://example.com/lacle"class="sister"id="link2">Lacle</a>and
      8. <a hred="http://example.com/tilie"class="sister"id="link3">Tillie</a>
      9. and they lived at bottom of a well.</p>
      10. <p class="story">...</p>
      11. '''
      12. from bs4 import BeautifulSoup
      13. soup = BeautifulSoup(html,'lxml')
      14.  
      15. print(type(soup.title))
      16. #<class 'bs4.element.Tag'>
      17. print(soup.head.title.string)#观察html的代码,其中有一层包含的关系:headtitle),那我们就可以用嵌套的形式将其内容打印出来;bodyp或是a
      18. #The Domouse's story
    6. 子节点和子孙节点

      1. #获取标签的子节点
      2. html2 = '''
      3. <html>
      4. <head>
      5. <title>The Domouse's story</title>
      6. </head>
      7. <body>
      8. <p class="story">
      9. Once upon a time there were little sisters;and their names were
      10. <a href="http://example.com/elsie" class="sister"id="link1">
      11. <span>Elsle</span>
      12. </a>
      13. <a hred="http://example.com/lacle"class="sister" id="link2">Lacle</a>
      14. and
      15. <a hred="http://example.com/tilie"class="sister" id="link3">Tillie</a>
      16. and they lived at bottom of a well.
      17. </p>
      18. <p class="story">...</p>
      19. '''
      20. from bs4 import BeautifulSoup
      21. soup2 = BeautifulSoup(html2,'lxml')
      22. print(soup2.p.contents)
      1. ['\n Once upon a time there were little sisters;and their names were\n ', <a class="sister" href="http://example.com/elsie" id="link1">
      2. <span>Elsle</span>
      3. </a>, '\n', <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>, '\n and\n ', <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>, '\n and they lived at bottom of a well.\n ']

      获得的内容

      另一中方法:

      1. #获取标签的子节点
      2. html2 = '''
      3. <html>
      4. <head>
      5. <title>The Domouse's story</title>
      6. </head>
      7. <body>
      8. <p class="story">
      9. Once upon a time there were little sisters;and their names were
      10. <a href="http://example.com/elsie" class="sister"id="link1">
      11. <span>Elsle</span>
      12. </a>
      13. <a hred="http://example.com/lacle"class="sister" id="link2">Lacle</a>
      14. and
      15. <a hred="http://example.com/tilie"class="sister" id="link3">Tillie</a>
      16. and they lived at bottom of a well.
      17. </p>
      18. <p class="story">...</p>
      19. '''
      20. from bs4 import BeautifulSoup
      21.  
      22. soup = BeautifulSoup(html2,'lxml')
      23.  
      24. print(soup.children)#不同之处:children实际上是一个迭代器,需要用循环的方式才能将内容取出
      25.  
      26. for i,child in enumerate(soup.p.children):
      27. print(i,child)
      1. <list_iterator object at 0x00000208F026B400>
      2. 0
      3. Once upon a time there were little sisters;and their names were
      4.  
      5. 1 <a class="sister" href="http://example.com/elsie" id="link1">
      6. <span>Elsle</span>
      7. </a>
      8. 2
      9.  
      10. 3 <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>
      11. 4
      12. and
      13.  
      14. 5 <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>
      15. 6
      16. and they lived at bottom of a well.

      获得的结果

      不同之处:children实际上是一个迭代器,需要用循环的方式才能将内容取出,而子节点只是一个列表

      1. #获取标签的子孙节点
      2. html2 = '''
      3. <html>
      4. <head>
      5. <title>The Domouse's story</title>
      6. </head>
      7. <body>
      8. <p class="story">
      9. Once upon a time there were little sisters;and their names were
      10. <a href="http://example.com/elsie" class="sister"id="link1">
      11. <span>Elsle</span>
      12. </a>
      13. <a hred="http://example.com/lacle"class="sister" id="link2">Lacle</a>
      14. and
      15. <a hred="http://example.com/tilie"class="sister" id="link3">Tillie</a>
      16. and they lived at bottom of a well.
      17. </p>
      18. <p class="story">...</p>
      19. '''
      20. from bs4 import BeautifulSoup
      21.  
      22. soup = BeautifulSoup(html2,'lxml')
      23.  
      24. print(soup2.p.descendants)#获取所有的子孙节点,也是一个迭代器
      25.  
      26. for i,child in enumerate(soup2.p.descendants):
      27. print(i,child)

      子孙节点

      1. <generator object descendants at 0x00000208F0240AF0>
      2. 0
      3. Once upon a time there were little sisters;and their names were
      4.  
      5. 1 <a class="sister" href="http://example.com/elsie" id="link1">
      6. <span>Elsle</span>
      7. </a>
      8. 2
      9.  
      10. 3 <span>Elsle</span>
      11. 4 Elsle
      12. 5
      13.  
      14. 6
      15.  
      16. 7 <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>
      17. 8 Lacle
      18. 9
      19. and
      20.  
      21. 10 <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>
      22. 11 Tillie
      23. 12
      24. and they lived at bottom of a well.

      --->获得的结果

    7. 父节点和祖先节点

      1. #父节点
      2. html = '''
      3. <html>
      4. <head>
      5. <title>The Domouse's story</title>
      6. </head>
      7. <body>
      8. <p class="story">
      9. Once upon a time there were little sisters;and their names were
      10. <a href="http://example.com/elsie" class="sister"id="link1">
      11. <span>Elsle</span>
      12. </a>
      13. <a hred="http://example.com/lacle"class="sister" id="link2">Lacle</a>
      14. and
      15. <a hred="http://example.com/tilie"class="sister" id="link3">Tillie</a>
      16. and they lived at bottom of a well.
      17. </p>
      18. <p class="story">...</p>
      19. '''
      20. from bs4 import BeautifulSoup
      21.  
      22. soup = BeautifulSoup(html,'lxml')
      23.  
      24. print(soup.a.parent)

      父节点

      1. <p class="story">
      2. Once upon a time there were little sisters;and their names were
      3. <a class="sister" href="http://example.com/elsie" id="link1">
      4. <span>Elsle</span>
      5. </a>
      6. <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>
      7. and
      8. <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>
      9. and they lived at bottom of a well.
      10. </p>

      --->获得的结果

      1. #获取祖先节点
      2. html = '''
      3. <html>
      4. <head>
      5. <title>The Domouse's story</title>
      6. </head>
      7. <body>
      8. <p class="story">
      9. Once upon a time there were little sisters;and their names were
      10. <a href="http://example.com/elsie" class="sister"id="link1">
      11. <span>Elsle</span>
      12. </a>
      13. <a hred="http://example.com/lacle"class="sister" id="link2">Lacle</a>
      14. and
      15. <a hred="http://example.com/tilie"class="sister" id="link3">Tillie</a>
      16. and they lived at bottom of a well.
      17. </p>
      18. <p class="story">...</p>
      19. '''
      20. from bs4 import BeautifulSoup
      21.  
      22. soup = BeautifulSoup(html,'lxml')
      23. print(list(enumerate(soup.a.parents)))#所有祖先节点(爸爸也算)

      祖先节点

      1. [(0, <p class="story">
      2. Once upon a time there were little sisters;and their names were
      3. <a class="sister" href="http://example.com/elsie" id="link1">
      4. <span>Elsle</span>
      5. </a>
      6. <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>
      7. and
      8. <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>
      9. and they lived at bottom of a well.
      10. </p>), (1, <body>
      11. <p class="story">
      12. Once upon a time there were little sisters;and their names were
      13. <a class="sister" href="http://example.com/elsie" id="link1">
      14. <span>Elsle</span>
      15. </a>
      16. <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>
      17. and
      18. <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>
      19. and they lived at bottom of a well.
      20. </p>
      21. <p class="story">...</p>
      22. </body>), (2, <html>
      23. <head>
      24. <title>The Domouse's story</title>
      25. </head>
      26. <body>
      27. <p class="story">
      28. Once upon a time there were little sisters;and their names were
      29. <a class="sister" href="http://example.com/elsie" id="link1">
      30. <span>Elsle</span>
      31. </a>
      32. <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>
      33. and
      34. <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>
      35. and they lived at bottom of a well.
      36. </p>
      37. <p class="story">...</p>
      38. </body></html>), (3, <html>
      39. <head>
      40. <title>The Domouse's story</title>
      41. </head>
      42. <body>
      43. <p class="story">
      44. Once upon a time there were little sisters;and their names were
      45. <a class="sister" href="http://example.com/elsie" id="link1">
      46. <span>Elsle</span>
      47. </a>
      48. <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>
      49. and
      50. <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>
      51. and they lived at bottom of a well.
      52. </p>
      53. <p class="story">...</p>
      54. </body></html>)]

      --->获得的内容

    8. 兄弟节点

      1. #获取前兄弟节点
      2. html = '''
      3. <html>
      4. <head>
      5. <title>The Domouse's story</title>
      6. </head>
      7. <body>
      8. <p class="story">
      9. Once upon a time there were little sisters;and their names were
      10. <a href="http://example.com/elsie" class="sister"id="link1">
      11. <span>Elsle</span>
      12. </a>
      13. <a hred="http://example.com/lacle"class="sister" id="link2">Lacle</a>
      14. and
      15. <a hred="http://example.com/tilie"class="sister" id="link3">Tillie</a>
      16. and they lived at bottom of a well.
      17. </p>
      18. <p class="story">...</p>
      19. '''
      20. from bs4 import BeautifulSoup
      21.  
      22. soup = BeautifulSoup(html,'lxml')
      23.  
      24. #兄弟节点(与之并列的节点)
      25. print(list(enumerate(soup.a.previous_siblings)))#前面的兄弟节点

      前兄弟节点

      1. [(0, '\n Once upon a time there were little sisters;and their names were\n ')]

      --->获得的内容

      1. html = '''
      2. <html>
      3. <head>
      4. <title>The Domouse's story</title>
      5. </head>
      6. <body>
      7. <p class="story">
      8. Once upon a time there were little sisters;and their names were
      9. <a href="http://example.com/elsie" class="sister"id="link1">
      10. <span>Elsle</span>
      11. </a>
      12. <a hred="http://example.com/lacle"class="sister" id="link2">Lacle</a>
      13. and
      14. <a hred="http://example.com/tilie"class="sister" id="link3">Tillie</a>
      15. and they lived at bottom of a well.
      16. </p>
      17. <p class="story">...</p>
      18. '''
      19. from bs4 import BeautifulSoup
      20.  
      21. soup = BeautifulSoup(html,'lxml')
      22.  
      23. #兄弟节点(与之并列的节点)
      24. print(list(enumerate(soup.a.next_siblings)))#后面的兄弟节点

      后面兄弟节点

      1. [(0, '\n'), (1, <a class="sister" hred="http://example.com/lacle" id="link2">Lacle</a>), (2, '\n and\n '), (3, <a class="sister" hred="http://example.com/tilie" id="link3">Tillie</a>), (4, '\n and they lived at bottom of a well.\n ')]

      --->获得的结果

  2. 标准选择器

    find_all(name,attrs,recursive,text,**kwargs)

      可以根据标签名,属性,内容查找文档

    1. 根据name查找

      1. html = '''
      2. <div class="panel">
      3. <div class="panel-heading"name="elements">
      4. <h4>Hello</h4>
      5. </div>
      6. <div class="panel-body">
      7. <ul class="list"Id="list-1">
      8. <li class="element">Foo</li>
      9. <li class="element">Bar</li>
      10. <li class="element">Jay</li>
      11. </ul>
      12. <ul class="list list-small"Id="list-2">
      13. <li class="element">Foo</li>
      14. <li class="element">Bar</li>
      15. </ul>
      16. </div>
      17. <div>
      18. '''
      19. from bs4 import BeautifulSoup
      20. soup = BeautifulSoup(html,'lxml')
      21.  
      22. print(soup.find_all('ul'))#列表类型
      23. print(type(soup.find_all('ul')[0]))
      1. [<ul class="list" id="list-1">
      2. <li class="element">Foo</li>
      3. <li class="element">Bar</li>
      4. <li class="element">Jay</li>
      5. </ul>, <ul class="list list-small" id="list-2">
      6. <li class="element">Foo</li>
      7. <li class="element">Bar</li>
      8. </ul>]
      9. <class 'bs4.element.Tag'>

      获得的结果

      1. html = '''
      2. <div class="panel">
      3. <div class="panel-heading"name="elements">
      4. <h4>Hello</h4>
      5. </div>
      6. <div class="panel-body">
      7. <ul class="list"Id="list-1">
      8. <li class="element">Foo</li>
      9. <li class="element">Bar</li>
      10. <li class="element">Jay</li>
      11. </ul>
      12. <ul class="list list-small"Id="list-2">
      13. <li class="element">Foo</li>
      14. <li class="element">Bar</li>
      15. </ul>
      16. </div>
      17. <div>
      18. '''
      19. from bs4 import BeautifulSoup
      20. soup = BeautifulSoup(html,'lxml')
      21.  
      22. for ul in soup.find_all('ul'):
      23. print(ul.find_all('li'))#层层嵌套的查找
      1. [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
      2. [<li class="element">Foo</li>, <li class="element">Bar</li>]

      获得的结果

    2. 根据attrs查找

      1. html = '''
      2. <div class="panel">
      3. <div class="panel-heading">
      4. <h4>Hello</h4>
      5. </div>
      6. <div class="panel-body">
      7. <ul class="list"id="list-1" name="elements">
      8. <li class="element">Foo</li>
      9. <li class="element">Bar</li>
      10. <li class="element">Jay</li>
      11. </ul>
      12. <ul class="list list-small"id="list-2">
      13. <li class="element">Foo</li>
      14. <li class="element">Bar</li>
      15. </ul>
      16. </div>
      17. <div>
      18. '''
      19. from bs4 import BeautifulSoup
      20. soup = BeautifulSoup(html,'lxml')
      21.  
      22. print(soup.find_all(attrs={'id':'list-1'}))
      23. print(soup.find_all(attrs={'name':'elements'}))
      1. [<ul class="list" id="list-1" name="elements">
      2. <li class="element">Foo</li>
      3. <li class="element">Bar</li>
      4. <li class="element">Jay</li>
      5. </ul>]
      6. [<ul class="list" id="list-1" name="elements">
      7. <li class="element">Foo</li>
      8. <li class="element">Bar</li>
      9. <li class="element">Jay</li>
      10. </ul>]

      获得的结果

      另一种方式

      1. html = '''
      2. <div class="panel">
      3. <div class="panel-heading">
      4. <h4>Hello</h4>
      5. </div>
      6. <div class="panel-body">
      7. <ul class="list"id="list-1">
      8. <li class="element">Foo</li>
      9. <li class="element">Bar</li>
      10. <li class="element">Jay</li>
      11. </ul>
      12. <ul class="list list-small"id="list-2">
      13. <li class="element">Foo</li>
      14. <li class="element">Bar</li>
      15. </ul>
      16. </div>
      17. <div>
      18. '''
      19. from bs4 import BeautifulSoup
      20. soup = BeautifulSoup(html,'lxml')
      21.  
      22. print(soup.find_all(id='list-1'))
      23. print(soup.find_all(class_='element'))

      另一种方式

      1. [<ul class="list" id="list-1">
      2. <li class="element">Foo</li>
      3. <li class="element">Bar</li>
      4. <li class="element">Jay</li>
      5. </ul>]
      6. [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]

      --->获得的结果

    3. 根据text查找

      1. #text
      2. html = '''
      3. <div class="panel">
      4. <div class="panel-heading">
      5. <h4>Hello</h4>
      6. </div>
      7. <div class="panel-body"name="elelments">
      8. <ul class="list"Id="list-1">
      9. <li class="element">Foo</li>
      10. <li class="element">Bar</li>
      11. <li class="element">Jay</li>
      12. </ul>
      13. <ul class="list list-small"Id="list-2">
      14. <li class="element">Foo</li>
      15. <li class="element">Bar</li>
      16. </ul>
      17. </div>
      18. <div>
      19. '''
      20. from bs4 import BeautifulSoup
      21. soup = BeautifulSoup(html,'lxml')
      22.  
      23. print(soup.find_all(text='Foo'))
      24. #['Foo', 'Foo']
      find(name,attrs,recursive,text,**kwargs)返回单个元素,find_all返回所有元素
      1. html = '''
      2. <div class="panel">
      3. <div class="panel-heading">
      4. <h4>Hello</h4>
      5. </div>
      6. <div class="panel-body"name="elelments">
      7. <ul class="list"Id="list-1">
      8. <li class="element">Foo</li>
      9. <li class="element">Bar</li>
      10. <li class="element">Jay</li>
      11. </ul>
      12. <ul class="list list-small"Id="list-2">
      13. <li class="element">Foo</li>
      14. <li class="element">Bar</li>
      15. </ul>
      16. </div>
      17. <div>
      18. '''
      19. from bs4 import BeautifulSoup
      20. soup = BeautifulSoup(html,'lxml')
      21.  
      22. print(soup.find('ul'))
      23. print(type(soup.find('ul')))
      24. print(soup.find('page'))
      1. <ul class="list" id="list-1">
      2. <li class="element">Foo</li>
      3. <li class="element">Bar</li>
      4. <li class="element">Jay</li>
      5. </ul>
      6. <class 'bs4.element.Tag'>
      7. None

      获得的结果

    4. 其他方法

      1. 如果使用find方法,返回单个元素
      2.  
      3. find_parents()返回所有祖先节点
      4. find_parent()返回直接父节点
      5. find_next_siblings()返回后面所有兄弟节点
      6. find_next_sibling()返回后面第一个兄弟节点
      7. find_previous_siblings()返回前面所有的兄弟节点
      8. find_previous_sibling()返回前面第一个的兄弟节点
      9. find_all_next()返回节点后所有符合条件的节点
      10. find_next()返回节点后第一个符合条件的节点
      11. find_all_previous()返回节点后所有符合条件的节点
      12. find_previous()返回第一个符合条件的节点
  3. CSS选择器(通过select()直接传入CSS选择器即可完成选择)

      1. html = '''
      2. <div class="panel">
      3. <div class="panel-heading">
      4. <h4>Hello</h4>
      5. </div>
      6. <div class="panel-body"name="elelments">
      7. <ul class="list"id="list-1">
      8. <li class="element">Foo</li>
      9. <li class="element">Bar</li>
      10. <li class="element">Jay</li>
      11. </ul>
      12. <ul class="list list-small"id="list-2">
      13. <li class="element">Foo</li>
      14. <li class="element">Bar</li>
      15. </ul>
      16. </div>
      17. <div>
      18. '''
      19. from bs4 import BeautifulSoup
      20. soup = BeautifulSoup(html,'lxml')
      21.  
      22. print(soup.select('.panel .panel-heading')) #class就需要加一个“.”
      23. print(soup.select('ul li')) #选择标签
      24. print(soup.select('#list-2 .element'))
      25. print(type(soup.select('ul')[0]))
      1. [<div class="panel-heading">
      2. <h4>Hello</h4>
      3. </div>]
      4. [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
      5. [<li class="element">Foo</li>, <li class="element">Bar</li>]
      6. <class 'bs4.element.Tag'>

      获得的结果

      另一种方法:

      1. html = '''
      2. <div class="panel">
      3. <div class="panel-heading">
      4. <h4>Hello</h4>
      5. </div>
      6. <div class="panel-body"name="elelments">
      7. <ul class="list"Id="list-1">
      8. <li class="element">Foo</li>
      9. <li class="element">Bar</li>
      10. <li class="element">Jay</li>
      11. </ul>
      12. <ul class="list list-small"Id="list-2">
      13. <li class="element">Foo</li>
      14. <li class="element">Bar</li>
      15. </ul>
      16. </div>
      17. <div>
      18. '''
      19.  
      20. from bs4 import BeautifulSoup
      21. soup = BeautifulSoup(html,'lxml')
      22.  
      23. for ul in soup.select('ul'):#直接print(soup.select('ul li'))
      24. print(ul.select('li'))

      另一种方法

      1. [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
      2. [<li class="element">Foo</li>, <li class="element">Bar</li>]

      --->获得的结果

    1. 获取属性

      1. html = '''
      2. <div class="panel">
      3. <div class="panel-heading">
      4. <h4>Hello</h4>
      5. </div>
      6. <div class="panel-body"name="elelments">
      7. <ul class="list"id="list-1">
      8. <li class="element">Foo</li>
      9. <li class="element">Bar</li>
      10. <li class="element">Jay</li>
      11. </ul>
      12. <ul class="list list-small"id="list-2">
      13. <li class="element">Foo</li>
      14. <li class="element">Bar</li>
      15. </ul>
      16. </div>
      17. <div>
      18. '''
      19. from bs4 import BeautifulSoup
      20. soup = BeautifulSoup(html,'lxml')
      21.  
      22. for ul in soup.select('ul'):
      23. print(ul['id'])#直接用[]
      24. print(ul.attrs['id'])#或是attrs+[]
      1. list-1
      2. list-1
      3. list-2
      4. list-2

      获得的结果

    2. 获取内容

      1. html = '''
      2. <div class="panel">
      3. <div class="panel-heading">
      4. <h4>Hello</h4>
      5. </div>
      6. <div class="panel-body"name="elelments">
      7. <ul class="list"Id="list-1">
      8. <li class="element">Foo</li>
      9. <li class="element">Bar</li>
      10. <li class="element">Jay</li>
      11. </ul>
      12. <ul class="list list-small"Id="list-2">
      13. <li class="element">Foo</li>
      14. <li class="element">Bar</li>
      15. </ul>
      16. </div>
      17. <div>
      18. '''
      19. from bs4 import BeautifulSoup
      20. soup = BeautifulSoup(html,'lxml')
      21.  
      22. for li in soup.select('li'):
      23. print(li['class'], li.get_text())
      1. ['element'] Foo
      2. ['element'] Bar
      3. ['element'] Jay
      4. ['element'] Foo
      5. ['element'] Bar

      获得的结果

  • 总结

推荐使用'lxml'解析库,必要时使用html.parser

标签选择器筛选功能但速度快

建议使用find(),find_all()查询匹配单个结果或者多个结果

如果对CSS选择器熟悉建议选用select()

记住常用的获取属性和文本值得方法

PYTHON 爬虫笔记五:BeautifulSoup库基础用法的更多相关文章

  1. PYTHON 爬虫笔记七:Selenium库基础用法

    知识点一:Selenium库详解及其基本使用 什么是Selenium selenium 是一套完整的web应用程序测试系统,包含了测试的录制(selenium IDE),编写及运行(Selenium ...

  2. PYTHON 爬虫笔记六:PyQuery库基础用法

    知识点一:PyQuery库详解及其基本使用 初始化 字符串初始化 html = ''' <div> <ul> <li class="item-0"&g ...

  3. Python爬虫进阶五之多线程的用法

    前言 我们之前写的爬虫都是单个线程的?这怎么够?一旦一个地方卡到不动了,那不就永远等待下去了?为此我们可以使用多线程或者多进程来处理. 首先声明一点! 多线程和多进程是不一样的!一个是 thread ...

  4. Python爬虫利器:BeautifulSoup库

    Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. BeautifulSoup ...

  5. PYTHON 爬虫笔记三:Requests库的基本使用

    知识点一:Requests的详解及其基本使用方法 什么是requests库 Requests库是用Python编写的,基于urllib,采用Apache2 Licensed开源协议的HTTP库,相比u ...

  6. Python爬虫利器五之Selenium的用法

    1.简介 Selenium 是什么?一句话,自动化测试工具.它支持各种浏览器,包括 Chrome,Safari,Firefox 等主流界面式浏览器,如果你在这些浏览器里面安装一个 Selenium 的 ...

  7. 吴裕雄--天生自然python学习笔记:beautifulsoup库的使用

    Beautiful Soup 库简介 Beautiful Soup提供一些简单的.python式的函数用来处理导航.搜索.修改分析树等功能.它是一个工具箱,通过解析文档为用户提供需要抓取的数据,因为简 ...

  8. 芝麻HTTP: Python爬虫利器之Requests库的用法

    前言 之前我们用了 urllib 库,这个作为入门的工具还是不错的,对了解一些爬虫的基本理念,掌握爬虫爬取的流程有所帮助.入门之后,我们就需要学习一些更加高级的内容和工具来方便我们的爬取.那么这一节来 ...

  9. python爬虫笔记----4.Selenium库(自动化库)

    4.Selenium库 (自动化测试工具,支持多种浏览器,爬虫主要解决js渲染的问题) pip install selenium 基本使用 from selenium import webdriver ...

随机推荐

  1. centos下lvs配置

    一.lvs-nat模式 网络配置: lvs-server eth0 :host-only adapter 192.168.56.101 lvs-server eth1 :Internal 192.16 ...

  2. JVM的GC简介和实例

    本文是一次内部分享中总结了jvm gc的分类和一些实例, 内容是introduction级别的,供初学人士参考.成文仓促,难免有些错误,如果有大牛发现,请留言,我一定及时更正,谢谢!JVM内存布局主要 ...

  3. struts2的BaseAction<T>继承ActionSupport实现ModelDriven<T>

    public class BaseAction<T> extends ActionSupport implements ModelDriven<T> { private sta ...

  4. Node.js学习笔记(6)——使用Express创建一个工程

    前提是搭建好了环境,node,npm,express:(推荐全局安装) 开始用express创建一个基础工程: express –t ejs microblog 进入文件夹之后 npm-install ...

  5. Apache2.4 新virtualhost

    创建配置文件 /etc/apache2/sites-available# sudo nano mysite.conf <VirtualHost *:> #ServerName hello. ...

  6. 电容有什么作用?为什么cpu电源引脚都并联一个电容?

    管理 随笔- 17  文章- 1  评论- 1  电容有什么作用?为什么cpu电源引脚都并联一个电容?   正文: 参考资料:http://blog.sina.com.cn/s/blog_7880d3 ...

  7. ORACLE物化视图具体解释

    一.物化的一般使用方法物化视图是一种特殊的物理表,"物化"(Materialized)视图是相对普通视图而言的.普通视图是虚拟表.应用的局限性大,不论什么对视图的查询.oracle ...

  8. 新手学习JSP+Servlet笔记一

    作为一个新手,初次接触jsp,servlet,习惯了后台的开发,前台的知识一窍不通,利用闲暇时间,给自己补补,从MyEclipse开始. 安装好MyEclipse之后,没有安装程序的可以下载 http ...

  9. MySQL5.7.18 备份、Mysqldump,mysqlpump,xtrabackup,innobackupex 全量,增量备份,数据导入导出

    粗略介绍冷备,热备,温暖,及Mysqldump,mysqlpump,xtrabackup,innobackupex 全量,增量备份 --备份的目的 灾难恢复:意外情况下(如服务器宕机.磁盘损坏等)对损 ...

  10. PHPExcel简易使用教程

    在企业里使用PHP进行开发,不可避免总会遇到读/写Excel的需求,遇到这种需求,一般使用PHPExcel类库进行开发. PHPExcel现在最新版本是1.8.0,最低需要PHP5.2版本,支持读取x ...