快速使用

通过下面的一个例子,对bs4有个简单的了解,以及看一下它的强大之处:

  1. from bs4 import BeautifulSoup
  2.  
  3. html = '''
  4. <html><head><title>The Dormouse's story</title></head>
  5. <body>
  6. <p class="title"><b>The Dormouse's story</b></p>
  7.  
  8. <p class="story">Once upon a time there were three little sisters; and their names were
  9. <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
  10. <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
  11. <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
  12. and they lived at the bottom of a well.</p>
  13. <p class="story">...</p>
  14. '''
  15. soup = BeautifulSoup(html,'lxml')
  16. print(soup.prettify()) #补全HTML
  17. print('-------------------------')
  18. print(soup.title)
  19. print('-------------------------')
  20. print(soup.title.name)
  21. print('-------------------------')
  22. print(soup.title.string) #得到标签里面的内容
  23. print('-------------------------')
  24. print(soup.title.parent.name)
  25. print('-------------------------')
  26. print(soup.p)
  27. print('-------------------------')
  28. print(soup.p["class"])
  29. print('-------------------------')
  30. print(soup.a)
  31. print('-------------------------')
  32. print(soup.find_all('a'))
  33. print('-------------------------')
  34. print(soup.find(id='link3'))
  35. print('-------------------------')

结果如下:

  1. <html>
  2. <head>
  3. <title>
  4. The Dormouse's story
  5. </title>
  6. </head>
  7. <body>
  8. <p class="title">
  9. <b>
  10. The Dormouse's story
  11. </b>
  12. </p>
  13. <p class="story">
  14. Once upon a time there were three little sisters; and their names were
  15. <a class="sister" href="http://example.com/elsie" id="link1">
  16. Elsie
  17. </a>
  18. ,
  19. <a class="sister" href="http://example.com/lacie" id="link2">
  20. Lacie
  21. </a>
  22. and
  23. <a class="sister" href="http://example.com/tillie" id="link3">
  24. Tillie
  25. </a>
  26. ;
  27. and they lived at the bottom of a well.
  28. </p>
  29. <p class="story">
  30. ...
  31. </p>
  32. </body>
  33. </html>
  34. -------------------------
  35. <title>The Dormouse's story</title>
  36. -------------------------
  37. title
  38. -------------------------
  39. The Dormouse's story
  40. -------------------------
  41. head
  42. -------------------------
  43. <p class="title"><b>The Dormouse's story</b></p>
  44. -------------------------
  45. ['title']
  46. -------------------------
  47. <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
  48. -------------------------
  49. [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
  50. -------------------------
  51. <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
  52. -------------------------

基本使用

标签选择器

在快速使用中我们添加如下代码:
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)

通过这种soup.标签名 我们就可以获得这个标签的内容
这里有个问题需要注意,通过这种方式获取标签,如果文档中有多个这样的标签,返回的结果是第一个标签的内容,如上面我们通过soup.p获取p标签,而文档中有多个p标签,但是只返回了第一个p标签内容

获取名称

当我们通过soup.title.name的时候就可以获得该title标签的名称,即title

获取属性

print(soup.p.attrs['name'])
print(soup.p['name'])
上面两种方式都可以获取p标签的name属性值

获取内容

print(soup.p.string)
结果就可以获取第一个p标签的内容:
The Dormouse's story

嵌套选择

我们直接可以通过下面嵌套的方式获取

print(soup.head.title.string)

标签选择器

选择元素

  1. from bs4 import BeautifulSoup
  2.  
  3. html = '''
  4. <html><head><title>The Dormouse's story</title></head>
  5. <body>
  6. <p class="title"><b>The Dormouse's story</b></p>
  7.  
  8. <p class="story">Once upon a time there were three little sisters; and their names were
  9. <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
  10. <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
  11. <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
  12. and they lived at the bottom of a well.</p>
  13. <p class="story">...</p>
  14. '''
  15. soup = BeautifulSoup(html,'lxml')
  16. print(soup.title)
  17. print('-------------------------')
  18. print(type(soup.title))
  19. print('-------------------------')
  20. print(soup.head)
  21. print('-------------------------')
  22. print(soup.p)

结果:

  1. <title>The Dormouse's story</title>
  2. -------------------------
  3. <class 'bs4.element.Tag'>
  4. -------------------------
  5. <head><title>The Dormouse's story</title></head>
  6. -------------------------
  7. <p class="title"><b>The Dormouse's story</b></p>

HTML中有多个 p 标签,但是最终 值输出一个,可以得出当有多个是只返回第一个结果

获取名称

  1. from bs4 import BeautifulSoup
  2.  
  3. html = '''
  4. <html><head><title>The Dormouse's story</title></head>
  5. <body>
  6. <p class="title"><b>The Dormouse's story</b></p>
  7.  
  8. <p class="story">Once upon a time there were three little sisters; and their names were
  9. <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
  10. <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
  11. <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
  12. and they lived at the bottom of a well.</p>
  13. <p class="story">...</p>
  14. '''
  15. soup = BeautifulSoup(html,'lxml')
  16. print(soup.title.name)

结果:

  1. title

 title.name 获取最外层的标题

获取属性

  1. from bs4 import BeautifulSoup
  2.  
  3. html = '''
  4. <html><head><title>The Dormouse's story</title></head>
  5. <body>
  6. <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
  7.  
  8. <p class="story">Once upon a time there were three little sisters; and their names were
  9. <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
  10. <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
  11. <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
  12. and they lived at the bottom of a well.</p>
  13. <p class="story">...</p>
  14. '''
  15. soup = BeautifulSoup(html,'lxml')
  16. print(soup.p.attrs['name'])
  17. print(soup.p['name'])

结果:

  1. dromouse
  2. dromouse

获取内容

  1. from bs4 import BeautifulSoup
  2.  
  3. html = '''
  4. <html><head><title>The Dormouse's story</title></head>
  5. <body>
  6. <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
  7.  
  8. <p class="story">Once upon a time there were three little sisters; and their names were
  9. <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
  10. <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
  11. <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
  12. and they lived at the bottom of a well.</p>
  13. <p class="story">...</p>
  14. '''
  15. soup = BeautifulSoup(html,'lxml')
  16. print(soup.p.string)

结果:

  1. The Dormouse's story

嵌套选择

  1. from bs4 import BeautifulSoup
  2.  
  3. html = '''
  4. <html><head><title>The Dormouse's story</title></head>
  5. <body>
  6. <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
  7.  
  8. <p class="story">Once upon a time there were three little sisters; and their names were
  9. <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
  10. <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
  11. <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
  12. and they lived at the bottom of a well.</p>
  13. <p class="story">...</p>
  14. '''
  15. soup = BeautifulSoup(html,'lxml')
  16. print(soup.head.title.string)

结果:

  1. The Dormouse's story
  1. from bs4 import BeautifulSoup
  2.  
  3. html = '''
  4. <html><head><title>The Dormouse's story</title></head>
  5. <body>
  6. <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
  7.  
  8. <p class="story">Once upon a time there were three little sisters; and their names were
  9. <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
  10. <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
  11. <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
  12. and they lived at the bottom of a well.</p>
  13. <p class="story">...</p>
  14. '''
  15. soup = BeautifulSoup(html,'lxml')
  16. print(soup.p.b.string)

结果:

  1. The Dormouse's story

使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出。

同时我们通过下面代码可以分别获取所有的链接,以及文字内容:

  1. for link in soup.find_all('a'):
  2. print(link.get('href'))
  3.  
  4. print(soup.get_text())

解析器

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,如果我们不安装它,则 Python 会使用 Python默认的解析器,lxml 解析器更加强大,速度更快,推荐安装。

下面是常见解析器:

推荐使用lxml作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml或html5lib, 因为那些Python版本的标准库中内置的HTML解析方法不够稳定.

子节点和子孙节点

contents的使用
通过下面例子演示:

  1. html = """
  2. <html>
  3. <head>
  4. <title>The Dormouse's story</title>
  5. </head>
  6. <body>
  7. <p class="story">
  8. Once upon a time there were three little sisters; and their names were
  9. <a href="http://example.com/elsie" class="sister" id="link1">
  10. <span>Elsie</span>
  11. </a>
  12. <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
  13. and
  14. <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
  15. and they lived at the bottom of a well.
  16. </p>
  17. <p class="story">...</p>
  18. """
  19.  
  20. from bs4 import BeautifulSoup
  21.  
  22. soup = BeautifulSoup(html,'lxml')
  23. print(soup.p.contents)
  1. ['\n Once upon a time there were three little sisters; and their names were\n ', <a class="sister" href="http://example.com/elsie" id="link1">
  2. <span>Elsie</span>
  3. </a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, '\n and\n ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n and they lived at the bottom of a well.\n ']

结果是将p标签下的所有子标签存入到了一个列表中

列表中会存入如下元素

children的使用

通过下面的方式也可以获取p标签下的所有子节点内容和通过contents获取的结果是一样的,但是不同的地方是soup.p.children是一个迭代对象,而不是列表,只能通过循环的方式获取素有的信息

  1. html = """
  2. <html>
  3. <head>
  4. <title>The Dormouse's story</title>
  5. </head>
  6. <body>
  7. <p class="story">
  8. Once upon a time there were three little sisters; and their names were
  9. <a href="http://example.com/elsie" class="sister" id="link1">
  10. <span>Elsie</span>
  11. </a>
  12. <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
  13. and
  14. <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
  15. and they lived at the bottom of a well.
  16. </p>
  17. <p class="story">...</p>
  18. """
  19.  
  20. from bs4 import BeautifulSoup
  21.  
  22. soup = BeautifulSoup(html,'lxml')
  23. print(soup.p.children)
  24. for child in soup.p.children:
  25. print(child)

结果:

  1. <list_iterator object at 0x000001AD777B49B0>
  2.  
  3. Once upon a time there were three little sisters; and their names were
  4.  
  5. <a class="sister" href="http://example.com/elsie" id="link1">
  6. <span>Elsie</span>
  7. </a>
  8.  
  9. <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  10.  
  11. and
  12.  
  13. <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
  14.  
  15. and they lived at the bottom of a well.
  1. html = """
  2. <html>
  3. <head>
  4. <title>The Dormouse's story</title>
  5. </head>
  6. <body>
  7. <p class="story">
  8. Once upon a time there were three little sisters; and their names were
  9. <a href="http://example.com/elsie" class="sister" id="link1">
  10. <span>Elsie</span>
  11. </a>
  12. <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
  13. and
  14. <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
  15. and they lived at the bottom of a well.
  16. </p>
  17. <p class="story">...</p>
  18. """
  19.  
  20. from bs4 import BeautifulSoup
  21.  
  22. soup = BeautifulSoup(html,'lxml')
  23. print(soup.p.children)
  24. for i,child in enumerate(soup.p.children): #enumerate 枚举
  25. print(i,child)

结果:

  1. <list_iterator object at 0x000001AD778605F8>
  2. 0
  3. Once upon a time there were three little sisters; and their names were
  4.  
  5. 1 <a class="sister" href="http://example.com/elsie" id="link1">
  6. <span>Elsie</span>
  7. </a>
  8. 2
  9.  
  10. 3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  11. 4
  12. and
  13.  
  14. 5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
  15. 6
  16. and they lived at the bottom of a well.

通过contents以及children都是获取子节点,如果想要获取子孙节点可以通过descendants
print(soup.descendants)同时这种获取的结果也是一个迭代器

获取子孙节点

  1. html = """
  2. <html>
  3. <head>
  4. <title>The Dormouse's story</title>
  5. </head>
  6. <body>
  7. <p class="story">
  8. Once upon a time there were three little sisters; and their names were
  9. <a href="http://example.com/elsie" class="sister" id="link1">
  10. <span>Elsie</span>
  11. </a>
  12. <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
  13. and
  14. <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
  15. and they lived at the bottom of a well.
  16. </p>
  17. <p class="story">...</p>
  18. """
  19.  
  20. from bs4 import BeautifulSoup
  21.  
  22. soup = BeautifulSoup(html,'lxml')
  23. print(soup.p.descendants)
  24. for i,child in enumerate(soup.p.descendants):
  25. print(i,child)

结果:

  1. <generator object Tag.descendants at 0x000001AD777B3A98>
  2. 0
  3. Once upon a time there were three little sisters; and their names were
  4.  
  5. 1 <a class="sister" href="http://example.com/elsie" id="link1">
  6. <span>Elsie</span>
  7. </a>
  8. 2
  9.  
  10. 3 <span>Elsie</span>
  11. 4 Elsie
  12. 5
  13.  
  14. 6
  15.  
  16. 7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  17. 8 Lacie
  18. 9
  19. and
  20.  
  21. 10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
  22. 11 Tillie
  23. 12
  24. and they lived at the bottom of a well.

父节点和祖先节点

通过soup.a.parent就可以获取父节点的信息

  1. html = """
  2. <html>
  3. <head>
  4. <title>The Dormouse's story</title>
  5. </head>
  6. <body>
  7. <p class="story">
  8. Once upon a time there were three little sisters; and their names were
  9. <a href="http://example.com/elsie" class="sister" id="link1">
  10. <span>Elsie</span>
  11. </a>
  12. <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
  13. and
  14. <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
  15. and they lived at the bottom of a well.
  16. </p>
  17. <p class="story">...</p>
  18. """
  19.  
  20. from bs4 import BeautifulSoup
  21.  
  22. soup = BeautifulSoup(html,'lxml')
  23. print(soup.a.parent)

结果:

  1. <p class="story">
  2. Once upon a time there were three little sisters; and their names were
  3. <a class="sister" href="http://example.com/elsie" id="link1">
  4. <span>Elsie</span>
  5. </a>
  6. <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  7. and
  8. <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
  9. and they lived at the bottom of a well.
  10. </p>

通过list(enumerate(soup.a.parents))可以获取祖先节点,这个方法返回的结果是一个列表,会分别将a标签的父节点的信息存放到列表中,以及父节点的父节点也放到列表中,并且最后还会讲整个文档放到列表中,所有列表的最后一个元素以及倒数第二个元素都是存的整个文档的信息

祖先节点:

  1. html = """
  2. <html>
  3. <head>
  4. <title>The Dormouse's story</title>
  5. </head>
  6. <body>
  7. <p class="story">
  8. Once upon a time there were three little sisters; and their names were
  9. <a href="http://example.com/elsie" class="sister" id="link1">
  10. <span>Elsie</span>
  11. </a>
  12. <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
  13. and
  14. <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
  15. and they lived at the bottom of a well.
  16. </p>
  17. <p class="story">...</p>
  18. """
  19.  
  20. from bs4 import BeautifulSoup
  21.  
  22. soup = BeautifulSoup(html,'lxml')
  23. print(list(enumerate(soup.a.parents)))

结果:

  1. [(0, <p class="story">
  2. Once upon a time there were three little sisters; and their names were
  3. <a class="sister" href="http://example.com/elsie" id="link1">
  4. <span>Elsie</span>
  5. </a>
  6. <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  7. and
  8. <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
  9. and they lived at the bottom of a well.
  10. </p>), (1, <body>
  11. <p class="story">
  12. Once upon a time there were three little sisters; and their names were
  13. <a class="sister" href="http://example.com/elsie" id="link1">
  14. <span>Elsie</span>
  15. </a>
  16. <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  17. and
  18. <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
  19. and they lived at the bottom of a well.
  20. </p>
  21. <p class="story">...</p>
  22. </body>), (2, <html>
  23. <head>
  24. <title>The Dormouse's story</title>
  25. </head>
  26. <body>
  27. <p class="story">
  28. Once upon a time there were three little sisters; and their names were
  29. <a class="sister" href="http://example.com/elsie" id="link1">
  30. <span>Elsie</span>
  31. </a>
  32. <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  33. and
  34. <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
  35. and they lived at the bottom of a well.
  36. </p>
  37. <p class="story">...</p>
  38. </body></html>), (3, <html>
  39. <head>
  40. <title>The Dormouse's story</title>
  41. </head>
  42. <body>
  43. <p class="story">
  44. Once upon a time there were three little sisters; and their names were
  45. <a class="sister" href="http://example.com/elsie" id="link1">
  46. <span>Elsie</span>
  47. </a>
  48. <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
  49. and
  50. <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
  51. and they lived at the bottom of a well.
  52. </p>
  53. <p class="story">...</p>
  54. </body></html>)]

兄弟节点

soup.a.next_siblings 获取后面的兄弟节点
soup.a.previous_siblings 获取前面的兄弟节点
soup.a.next_sibling 获取下一个兄弟标签
souo.a.previous_sinbling 获取上一个兄弟标签

  1. html = """
  2. <html>
  3. <head>
  4. <title>The Dormouse's story</title>
  5. </head>
  6. <body>
  7. <p class="story">
  8. Once upon a time there were three little sisters; and their names were
  9. <a href="http://example.com/elsie" class="sister" id="link1">
  10. <span>Elsie</span>
  11. </a>
  12. <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
  13. and
  14. <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
  15. and they lived at the bottom of a well.
  16. </p>
  17. <p class="story">...</p>
  18. """
  19.  
  20. from bs4 import BeautifulSoup
  21.  
  22. soup = BeautifulSoup(html,'lxml')
  23. print(list(enumerate(soup.a.next_siblings)))
  24. print(list(enumerate(soup.a.previous_siblings)))

结果:

  1. [(0, '\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>),
    (2, '\n and\n '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>),
    (4, '\n and they lived at the bottom of a well.\n ')]
  2. [(0, '\n Once upon a time there were three little sisters; and their names were\n ')]

标准选择器

find_all

find_all(name,attrs,recursive,text,**kwargs)
可以根据标签名,属性,内容查找文档

name的用法

  1. html='''
  2. <div class="panel">
  3. <div class="panel-heading">
  4. <h4>Hello</h4>
  5. </div>
  6. <div class="panel-body">
  7. <ul class="list" id="list-1">
  8. <li class="element">Foo</li>
  9. <li class="element">Bar</li>
  10. <li class="element">Jay</li>
  11. </ul>
  12. <ul class="list list-small" id="list-2">
  13. <li class="element">Foo</li>
  14. <li class="element">Bar</li>
  15. </ul>
  16. </div>
  17. </div>
  18. '''
  19. from bs4 import BeautifulSoup
  20. soup = BeautifulSoup(html, 'lxml')
  21. print(soup.find_all('ul'))
    print('--------------------')
  22. print(type(soup.find_all('ul')[0]))

结果返回的是一个列表的方式

  1. [<ul class="list" id="list-1">
  2. <li class="element">Foo</li>
  3. <li class="element">Bar</li>
  4. <li class="element">Jay</li>
  5. </ul>, <ul class="list list-small" id="list-2">
  6. <li class="element">Foo</li>
  7. <li class="element">Bar</li>
  8. </ul>]
  9. -----------------------
  10. <class 'bs4.element.Tag'>

同时我们是可以针对结果再次find_all,从而获取所有的li标签信息,层层嵌套的方法

  1. html='''
  2. <div class="panel">
  3. <div class="panel-heading">
  4. <h4>Hello</h4>
  5. </div>
  6. <div class="panel-body">
  7. <ul class="list" id="list-1">
  8. <li class="element">Foo</li>
  9. <li class="element">Bar</li>
  10. <li class="element">Jay</li>
  11. </ul>
  12. <ul class="list list-small" id="list-2">
  13. <li class="element">Foo</li>
  14. <li class="element">Bar</li>
  15. </ul>
  16. </div>
  17. </div>
  18. '''
  19. from bs4 import BeautifulSoup
  20. soup = BeautifulSoup(html, 'lxml')
  21. for ul in soup.find_all('ul'):
  22. print(ul.find_all('li'))

结果:

  1. [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
  2. [<li class="element">Foo</li>, <li class="element">Bar</li>]

attrs

例子如下:

  1. html='''
  2. <div class="panel">
  3. <div class="panel-heading">
  4. <h4>Hello</h4>
  5. </div>
  6. <div class="panel-body">
  7. <ul class="list" id="list-1" name="elements">
  8. <li class="element">Foo</li>
  9. <li class="element">Bar</li>
  10. <li class="element">Jay</li>
  11. </ul>
  12. <ul class="list list-small" id="list-2">
  13. <li class="element">Foo</li>
  14. <li class="element">Bar</li>
  15. </ul>
  16. </div>
  17. </div>
  18. '''
  19. from bs4 import BeautifulSoup
  20. soup = BeautifulSoup(html, 'lxml')
  21. print(soup.find_all(attrs={'id': 'list-1'}))
  22. print(soup.find_all(attrs={'name': 'elements'}))

结果:

  1. [<ul class="list" id="list-1" name="elements">
  2. <li class="element">Foo</li>
  3. <li class="element">Bar</li>
  4. <li class="element">Jay</li>
  5. </ul>]
  6. [<ul class="list" id="list-1" name="elements">
  7. <li class="element">Foo</li>
  8. <li class="element">Bar</li>
  9. <li class="element">Jay</li>
  10. </ul>]
  1. html='''
  2. <div class="panel">
  3. <div class="panel-heading">
  4. <h4>Hello</h4>
  5. </div>
  6. <div class="panel-body">
  7. <ul class="list" id="list-1" name="elements">
  8. <li class="element">Foo</li>
  9. <li class="element">Bar</li>
  10. <li class="element">Jay</li>
  11. </ul>
  12. <ul class="list list-small" id="list-2">
  13. <li class="element">Foo</li>
  14. <li class="element">Bar</li>
  15. </ul>
  16. </div>
  17. </div>
  18. '''
  19. from bs4 import BeautifulSoup
  20. soup = BeautifulSoup(html, 'lxml')
  21. print(soup.find_all(id='list-1'))
  22. print(soup.find_all(class_='element'))

结果:

  1. [<ul class="list" id="list-1" name="elements">
  2. <li class="element">Foo</li>
  3. <li class="element">Bar</li>
  4. <li class="element">Jay</li>
  5. </ul>]
  6. [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>,
    <li class="element">Foo</li>, <li class="element">Bar</li>]

attrs可以传入字典的方式来查找标签,但是这里有个特殊的就是class,因为class在python中是特殊的字段,所以如果想要查找class相关的可以更改attrs={'class_':'element'}或者soup.find_all('',{"class":"element}),特殊的标签属性可以不写attrs,例如id

text

例子如下:

  1. html='''
  2. <div class="panel">
  3. <div class="panel-heading">
  4. <h4>Hello</h4>
  5. </div>
  6. <div class="panel-body">
  7. <ul class="list" id="list-1">
  8. <li class="element">Foo</li>
  9. <li class="element">Bar</li>
  10. <li class="element">Jay</li>
  11. </ul>
  12. <ul class="list list-small" id="list-2">
  13. <li class="element">Foo</li>
  14. <li class="element">Bar</li>
  15. </ul>
  16. </div>
  17. </div>
  18. '''
  19. from bs4 import BeautifulSoup
  20. soup = BeautifulSoup(html, 'lxml')
  21. print(soup.find_all(text='Foo'))

结果返回的是查到的所有的text='Foo'的文本

  1. ['Foo', 'Foo']

find

find(name,attrs,recursive,text,**kwargs)
find返回的匹配结果的第一个元素

  1. html='''
  2. <div class="panel">
  3. <div class="panel-heading">
  4. <h4>Hello</h4>
  5. </div>
  6. <div class="panel-body">
  7. <ul class="list" id="list-1">
  8. <li class="element">Foo</li>
  9. <li class="element">Bar</li>
  10. <li class="element">Jay</li>
  11. </ul>
  12. <ul class="list list-small" id="list-2">
  13. <li class="element">Foo</li>
  14. <li class="element">Bar</li>
  15. </ul>
  16. </div>
  17. </div>
  18. '''
  19. from bs4 import BeautifulSoup
  20. soup = BeautifulSoup(html, 'lxml')
  21. print(soup.find('ul'))
  22. print(type(soup.find('ul')))
  23. print(soup.find('page'))

结果:

  1. <ul class="list" id="list-1">
  2. <li class="element">Foo</li>
  3. <li class="element">Bar</li>
  4. <li class="element">Jay</li>
  5. </ul>
  6. <class 'bs4.element.Tag'>
  7. None

其他一些类似的用法:

find_parents()返回所有祖先节点,find_parent()返回直接父节点。
find_next_siblings()返回后面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点。
find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点。
find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点
find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点

CSS选择器

通过select()直接传入CSS选择器就可以完成选择
熟悉前端的人对CSS可能更加了解,其实用法也是一样的

标签1,标签2 找到所有的标签1和标签2
标签1 标签2 找到标签1内部的所有的标签2
[attr] 可以通过这种方法找到具有某个属性的所有标签
[atrr=value] 例子[target=_blank]表示查找所有target=_blank的标签

前面加 .表示class #表示id ,选择标签不需要加

  1. html='''
  2. <div class="panel">
  3. <div class="panel-heading">
  4. <h4>Hello</h4>
  5. </div>
  6. <div class="panel-body">
  7. <ul class="list" id="list-1">
  8. <li class="element">Foo</li>
  9. <li class="element">Bar</li>
  10. <li class="element">Jay</li>
  11. </ul>
  12. <ul class="list list-small" id="list-2">
  13. <li class="element">Foo</li>
  14. <li class="element">Bar</li>
  15. </ul>
  16. </div>
  17. </div>
  18. '''
  19. from bs4 import BeautifulSoup
  20. soup = BeautifulSoup(html, 'lxml')
  21. print(soup.select('.panel .panel-heading'))
  22. print(soup.select('ul li'))
  23. print(soup.select('#list-2 .element'))
  24. print(type(soup.select('ul')[0]))

结果:

  1. [<div class="panel-heading">
  2. <h4>Hello</h4>
  3. </div>]
  4. [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>,
    <li class="element">Foo</li>, <li class="element">Bar</li>]
  5. [<li class="element">Foo</li>, <li class="element">Bar</li>]
  6. <class 'bs4.element.Tag'>

获取内容

通过get_text()就可以获取文本内容

  1. html='''
  2. <div class="panel">
  3. <div class="panel-heading">
  4. <h4>Hello</h4>
  5. </div>
  6. <div class="panel-body">
  7. <ul class="list" id="list-1">
  8. <li class="element">Foo</li>
  9. <li class="element">Bar</li>
  10. <li class="element">Jay</li>
  11. </ul>
  12. <ul class="list list-small" id="list-2">
  13. <li class="element">Foo</li>
  14. <li class="element">Bar</li>
  15. </ul>
  16. </div>
  17. </div>
  18. '''
  19. from bs4 import BeautifulSoup
  20. soup = BeautifulSoup(html, 'lxml')
  21. for li in soup.select('li'):
  22. print(li.get_text())

结果:

  1. Foo
  2. Bar
  3. Jay
  4. Foo
  5. Bar

获取属性
或者属性的时候可以通过[属性名]或者attrs[属性名]

  1. html='''
  2. <div class="panel">
  3. <div class="panel-heading">
  4. <h4>Hello</h4>
  5. </div>
  6. <div class="panel-body">
  7. <ul class="list" id="list-1">
  8. <li class="element">Foo</li>
  9. <li class="element">Bar</li>
  10. <li class="element">Jay</li>
  11. </ul>
  12. <ul class="list list-small" id="list-2">
  13. <li class="element">Foo</li>
  14. <li class="element">Bar</li>
  15. </ul>
  16. </div>
  17. </div>
  18. '''
  19. from bs4 import BeautifulSoup
  20. soup = BeautifulSoup(html, 'lxml')
  21. for ul in soup.select('ul'):
  22. print(ul)
  23. print('---------------')
  24. print(ul['id'])
  25. print('---------------')
  26. print(ul.attrs['id'])
  27. print('***************')

结果:

  1. <ul class="list" id="list-1">
  2. <li class="element">Foo</li>
  3. <li class="element">Bar</li>
  4. <li class="element">Jay</li>
  5. </ul>
  6. ---------------
  7. list-1
  8. ---------------
  9. list-1
  10. ***************
  11. <ul class="list list-small" id="list-2">
  12. <li class="element">Foo</li>
  13. <li class="element">Bar</li>
  14. </ul>
  15. ---------------
  16. list-2
  17. ---------------
  18. list-2
  19. ***************

总结

推荐使用lxml解析库,必要时使用html.parser
标签选择筛选功能弱但是速度快
建议使用find()、find_all() 查询匹配单个结果或者多个结果
如果对CSS选择器熟悉建议使用select()
记住常用的获取属性和文本值的方法

原文:https://www.cnblogs.com/zhaof/p/6930955.html

python-BeautifulSoup库详解的更多相关文章

  1. Python turtle库详解

    Python turtle库详解 Turtle库是Python语言中一个很流行的绘制图像的函数库,想象一个小乌龟,在一个横轴为x.纵轴为y的坐标系原点,(0,0)位置开始,它根据一组函数指令的控制,在 ...

  2. python爬虫知识点总结(六)BeautifulSoup库详解

    官方学习文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/ 一.什么时BeautifulSoup? 答:灵活又方便的网页解析库,处 ...

  3. Python optparser库详解

    一直以来对optparser不是特别的理解,今天就狠下心,静下心研究了一下这个库.当然了,不敢说理解的很到位,但是足以应付正常的使用了.废话不多说,开始今天的分享吧. 简介 optparse模块主要用 ...

  4. python爬虫知识点详解

    python爬虫知识点总结(一)库的安装 python爬虫知识点总结(二)爬虫的基本原理 python爬虫知识点总结(三)urllib库详解 python爬虫知识点总结(四)Requests库的基本使 ...

  5. Python爬虫系列-Urllib库详解

    Urllib库详解 Python内置的Http请求库: * urllib.request 请求模块 * urllib.error 异常处理模块 * urllib.parse url解析模块 * url ...

  6. Python 字符串方法详解

    Python 字符串方法详解 本文最初发表于赖勇浩(恋花蝶)的博客(http://blog.csdn.net/lanphaday),如蒙转载,敬请保留全文完整,切勿去除本声明和作者信息.        ...

  7. Python--urllib3库详解1

    Python--urllib3库详解1 Urllib3是一个功能强大,条理清晰,用于HTTP客户端的Python库,许多Python的原生系统已经开始使用urllib3.Urllib3提供了很多pyt ...

  8. python/ORM操作详解

    一.python/ORM操作详解 ===================增==================== models.UserInfo.objects.create(title='alex ...

  9. Python开发技术详解PDF

    Python开发技术详解(高清版)PDF 百度网盘 链接:https://pan.baidu.com/s/1F5J9mFfHKgwhkC5KuPd0Pw 提取码:xxy3 复制这段内容后打开百度网盘手 ...

  10. Python环境搭建详解(Window平台)

    前言 Python,是一种面向对象的解释型计算机程序设计语言,是纯粹的自由软件,Python语法简洁清晰,特色是强制用空白符作为语句缩进,具有丰富和强大的库,它常被称为胶水语言. Python是一种解 ...

随机推荐

  1. 用cmd命令行编译JAVA程序时出现“找不到或无法加载主类”

    今天复习Java基础知识时,使用cmd命令窗口进行编译Java文件发现了如下问题: 网上有很多的解决方法,和问题出现的讨论,以下方法是解决我出现这个问题方式. 解决方式: 重点是圈住的部分. 下面是我 ...

  2. 前后端分离djangorestframework——路由组件

    在文章前后端分离djangorestframework——视图组件 中,见识了DRF的视图组件强大,其实里面那个url也是可以自动生成的,就是这么屌 DefaultRouter urls文件作如下调整 ...

  3. 数据库之mysql篇(4)—— navicat操作mysql

    navicat 1.简介: navicat是一个软件,旗下针对不同数据库有不同的软件版本,支持以下数据库,还是挺厉害的: 这里我采用navicat for mysql版本.实现图形化的操作mysql, ...

  4. June 12. 2018 Week 24th. Tuesday

    Just be yourself because you are unique and you will shine. 每个人都是独一无二的,做好你自己,你也能够光芒四射. From What a G ...

  5. Unity Shader 基础(4) 由深度纹理重建坐标

    在PostImage中经常会用到物体本身的位置信息,但是Image Effect自身是不包含这些信息的,因为屏幕后处其实是使用特定的材质渲染一个刚好填满屏幕的四边形面片(四个角对应近剪裁面的四个角). ...

  6. 力扣算法题—052N皇后问题2

    跟前面的N皇后问题没区别,还更简单 #include "000库函数.h" //使用回溯法 class Solution { public: int totalNQueens(in ...

  7. 设计模式C++实现——装饰者模式

    版权声明:本文为博主原创文章,未经博主同意不得转载. https://blog.csdn.net/walkerkalr/article/details/28633123 模式定义:         装 ...

  8. 如何在tomcat前部署一个nginx

    在tomcat应用已经发布后,如何在tomcat前部署一个nginx,可以正常访问jsp,静态资源(html,css,js) 这里tomcat的端口号是8888 upstream morris { s ...

  9. 转://利用从awr中查找好的执行计划来优化SQL

    原文地址:http://blog.csdn.net/zengxuewen2045/article/details/53495613 同事反应系统慢,看下是不是有锁了,登入数据库检查,没有异常锁定,但发 ...

  10. 前端框架Vue.js——vue-i18n ,vue项目中如何实现国际化

    本项目利用  VueI18n 组件进行国际化,使用之前,需要进行安装 $ npm install vue-i18n 一.框架引入步骤: 1. 先在 main.js 中引入 vue-i18n. // 国 ...