Python -- BeautifulSoup的学习使用

BeautifulSoup4.3 的使用

下载和安装

# 下载
http://www.crummy.com/software/BeautifulSoup/bs4/download/

# 解压后 使用root执行
# python setup.py install

# 最后 在python中测试是否成功
>>> import bs4

简单使用:

供练习的 Html Document

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html_doc)
# 知识点1： 打印漂亮的html  soup.prettify()
>>> print(soup.prettify())
<html>
 <head>
  <title>
   The Dormouse's story
  </title>

  </p>
 </body>
</html>

# 知识点2 解析获取html标签
>>> soup.title
<title>The Dormouse's story</title>
>>> soup.title.name
'title'
>>> soup.title.string
u"The Dormouse's story"
>>> soup.title.parent.name
'head'
>>> soup.p
<p class="title"><b>The Dormouse's story</b></p>
>>> soup.p['class']
['title']
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>> soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>> soup.find(id='link3')
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

总结:
>>> soup.title                    # 获取第一个title标签
>>> soup.title.name            # 获取标签名 --> title
>>> soup.title.string            # 获取标签内容
>>> soup.title.parent.name   # 获取title标签的父标签
>>> soup.p['class']             # 获取第一个p标签的class属性的值
>>> soup.find_all('a')          #获取所有a标签
>>> soup.find(id='link3')      #获取第一个id的值为link3的标签

# 知识点3  获取所有超链接
>>> for link in soup.find_all('a'):
...     print(link.get('href'))
...
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

# 知识点4 获取所有文本
>>> print(soup.get_text())
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

BeautifulSoup的四大对象 Tag, NavigableString, BeautifulSoup, and Comment.

Tag

>>> soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
>>> tag = soup.b
>>> type(tag)
<class 'bs4.element.Tag'>
>>> tag.name
'b'
>>> tag.name = 'blockquote'
>>> tag
<blockquote class="boldest">Extremely bold</blockquote>
>>> tag['class']
['boldest']
>>> tag.attrs
{'class': ['boldest']}
>>> tag['class'] = 'verybold'
>>> tag['id'] = 1
>>> tag
<blockquote class="verybold" id="1">Extremely bold</blockquote>
>>> del tag['class']
>>> del tag['id']
>>> tag
<blockquote>Extremely bold</blockquote>
>>> print(tag.get('class'))
None
>>> css_soup = BeautifulSoup('<p class="body strikeout"></p>')
>>> css_soup.p['class']
['body', 'strikeout']
>>> css_soup = BeautifulSoup('<p class="body"></p>')
>>> css_soup.p['class']
['body']
>>> id_soup = BeautifulSoup('<p id="my id"></p>')
>>> id_soup.p['id']
'my id'
>>> rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a>')
>>> rel_soup.a['rel']
['index']
>>> rel_soup.a['rel'] = ['index', 'contents']
>>> print(rel_soup.p)
<p>Back to the <a rel="index contents">homepage</a></p>

总结:
tag['class']          # 获取tag标签class属性的值
tag.attrs             # 获取tag标签所有属性
del tag['class']     # 删除tag标签的class属性

关于多值属性
>>> css_soup = BeautifulSoup('<p class="body strikeout"></p>')
>>> css_soup.p['class']
['body', 'strikeout']

>>> id_soup = BeautifulSoup('<p id="my id"></p>')
>>> id_soup.p['id']
'my id'
总结:
BeatifulSoup对于允许多值的属性 返回list， 对于不是多值的属性， 就只放回str

NavigableString -- 和String差不多

>>> tag.string
u'Extremely bold'
>>> type(tag.string)
<class 'bs4.element.NavigableString'>
>>> unicode_string = unicode(tag.string)
>>> unicode_string
u'Extremely bold'
>>> type(unicode_string)
<type 'unicode'>
>>> tag.string.replace_with('No loger bold')
u'Extremely bold'
>>> tag
<blockquote>No loger bold</blockquote>

总结:
1. NavigableString可以转换为unicode
2. 如果想替换NavigableString的值， 使用 replace_with()方法

BeautifulSoup对象  -- 整个Html Document对象

>>> soup.name
u'[document]'
>>> soup
<html><body><blockquote>No loger bold</blockquote></body></html>
>>> type(soup)
<class 'bs4.BeautifulSoup'>

Comments and other special strings

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>

print(soup.b.prettify())
# <b>
#  <!--Hey, buddy. Want to buy a used parser?-->
# </b>

解析HTML

准备

html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

最简单 -- 使用标签名

>>> soup.head
<head><title>The Dormouse's story</title></head>
>>> soup.body.b
<b>The Dormouse's story</b>
>>> soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

总结:
    1. soup.head          # 获取第一个head标签
    2. soup.body.b        # 获取第一个body下第一个b标签
    3. soup.find_all('a') # 获取所有a标签

.contents and .children

html_doc = '''
<html>
 <body>
  <a>
   href1
  </a>
  <a>
   href2
  </a>
 </body>
</html>
'''

# 获取子标签的方法1 -- 使用 .contents   用contents[0], contents[1]访问
>>> soup2 = BeautifulSoup(html_doc)
>>> contents = soup2.body.contents
>>> contents[0]
<a>href1</a>
>>> contents[1]
<a>href2</a>

# 方法2 -- 使用 .children   用于遍历
>>> for child in soup2.body.children:
...     print(child)
...
<a>href1</a>
<a>href2</a>

.descendants

.children 和 .contents只能获取直接后代  而 .descendants可以获得所有后代
>>> head_tag.contents
[<title>The Dormouse's story</title>]
>>> for child in head_tag.descendants:
...     print(child)
...
<title>The Dormouse's story</title>
The Dormouse's story

>>> head_tag
<head><title>The Dormouse's story</title></head>
>>> 

>>> len(list(soup.children))
1
>>> len(list(soup.descendants))
25
>>>

.string

>>> title_tag
<title>The Dormouse's story</title>
>>> title_tag.string
u"The Dormouse's story"

>>> print(soup.html.string)
None

总结：
    1. 如果一个标签下面没有其他标签， 那么.string就是他的值
    2. 如果一个标签下面还有子标签， 那么.string为 None

.strings and .stripped_strings

>>> for string in soup.strings:
...     print(repr(string))
...
u"The Dormouse's story"
u'\n'
u"The Dormouse's story"
u'\n'
u'Once upon a time there were three little sisters; and their names were\n'
u'Elsie'
u',\n'
u'Lacie'
u' and\n'
u'Tillie'
u';\nand they lived at the bottom of a well.'
u'\n'
u'...'
u'\n'

>>> for string in soup.stripped_strings:
...     print(repr(string))
...
u"The Dormouse's story"
u"The Dormouse's story"
u'Once upon a time there were three little sisters; and their names were'
u'Elsie'
u','
u'Lacie'
u'and'
u'Tillie'
u';\nand they lived at the bottom of a well.'
u'...'

总结:
    1. .strings 获取一个标签下面所有的string
    2. .stripped_strings: 忽略 值为'\n'的string
    3. 关于 repr --> 讲object转换成 string

.parent

# 例子1
>>> title_tag = soup.title
>>> title_tag
<title>The Dormouse's story</title>
>>> title_tag.parent
<head><title>The Dormouse's story</title></head>

# 例子2
>>> title_tag.string.parent
<title>The Dormouse's story</title>

# 例子3
>>> html_tag = soup.html
>>> type(html_tag.parent)
<class 'bs4.BeautifulSoup'>

# 例子4
>>> print(soup.parent)
None

总结:
    1. html标签的父标签是 BeautifulSoup对象
    2. BeautifulSoup没有父标签 （根节点）

.parents

>>> link = soup.a
>>> link
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>> for parent in link.parents:
...     if parent is None:
...         print(parent)
...     else:
...         print(parent.name)
...
p
body
html

兄弟节点

预备解析HTML

>>> sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
>>> print(sibling_soup.prettify())
<html>
 <body>
  <a>
   <b>
    text1
   </b>
   <c>
    text2
   </c>
  </a>
 </body>
</html>

.next_sibling and .previous_sibling

# 例子1 对照上面的prettify()输出
>>> sibling_soup.b.next_sibling
<c>text2</c>
>>> sibling_soup.c.previous_sibling
<b>text1</b>

# 例子2 看prettify()的输出， 可以看到 b标签上面没有兄弟标签 c标签下面也没有兄弟标签  因此输出是None
>>> print(sibling_soup.b.previous_sibling)
None
>>> print(sibling_soup.c.next_sibling)
None

# 例子3 注意点: text1没有兄弟节点   因为它和text2不是同一个父亲！
>>> sibling_soup.b.string
u'text1'
>>> sibling_soup.b.string.next_sibling
None

# 例子4 注意点： 第一个a标签的下一个兄弟节点是 '\n', 而不是 下一个<a>标签！（如果没有排版就不会）
先看所有的a标签
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

>>> link = soup.a
>>> link
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
>>> link.next_sibling
u',\n'
>>> link.next_sibling.next_sibling
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

# 例子5： 验证上面的说法 -- 没有排版的话， a标签的下一个标签就不是 “\n‘
>>> html_doc = '<a href="link1"></a><a href="link2"></a>'
>>> soup2 = BeautifulSoup(html_doc)
>>> link = soup2.a
>>> link
<a href="link1"></a>
>>> link.next_sibling
<a href="link2"></a>

.next_siblings and .previous_siblings

# 例子1
>>> for sibling in soup.a.next_siblings:
...     print(repr(sibling))
...
u',\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
u' and\n'
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
u';\nand they lived at the bottom of a well.'

# 例子2
>>> for sibling in soup.find(id='link3').previous_siblings:
...     print(repr(sibling))
...
u' and\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
u',\n'
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
u'Once upon a time there were three little sisters; and their names were\n'

.next_element and .previous_element

预备知识:
<html><head><title>The Dormouse's story</title></head></html>
HTML解析器如何解析这段？
打开html标签， 打开head标签， 打开title标签， 保存 'The Dormouse's stroy'这个string. 关闭 title标签， 关闭 head标签， 关闭html标签

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

# 例子
>>> second_a_tag = soup.find('a', id='link2')
>>> second_a_tag
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
>>> second_a_tag.next_sibling
u' and\n'
>>> second_a_tag.next_element
u'Lacie'

总结:
     HTML解析器读取到<a id='link2'>处， 所以下一个元素是 Lacie, 再下一个元素是 u' and\n'  (注： 结束标签不算在这里面)

.next_elements and .previous_elements

>>> last_a_tag = soup.find('a', id='link3')
>>> for element in last_a_tag.next_elements:
...     print(repr(element))
...
u'Tillie'
u';\nand they lived at the bottom of a well.'
u'\n'
<p class="story">...</p>
u'...'
u'\n'

find() and find_all()

find_all() 的简单使用 Signature: find_all(name, attrs, recursive, text, limit, **kwargs)

# 例子1. 搜索所有title标签
>>> soup.find_all('title')
[<title>The Dormouse's story</title>]

# 例子2. 搜索所有class为title 的 p标签
>>> soup.find_all('p', 'title')
[<p class="title"><b>The Dormouse's story</b></p>]

# 例子3. 搜索所有a标签
>>> soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# 例子4. 搜索所有 id为link2的标签
>>> soup.find_all(id='link2')
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

# 例子5. 搜索text中带有 sisters的标签
>>> import re
>>> soup.find(text=re.compile('sisters'))
u'Once upon a time there were three little sisters; and their names were\n'

使用函数作为参数

>>> def has_class_but_no_id(tag):
...     return tag.has_attr('class') and not tag.has_attr('id')
...
>>> soup.find_all(has_class_but_no_id)
[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]

find_all()的进阶使用

# 例子1. 搜索所有有id属性的标签
>>> soup.find_all(id=True)
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# 例子2. 搜索href中带有 elsie 并且 id的值为link1的标签
>>> soup.find_all(href=re.compile('elsie'), id='link1')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

# 例子3. 对于特殊的属性名
>>> data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
>>> data_soup.find_all(data-foo='value')
  File "<stdin>", line 1
SyntaxError: keyword can't be an expression

这样是不行的
使用 attrs={}
>>> data_soup.find_all(attrs={'data-foo': 'value'})
[<div data-foo="value">foo!</div>]

Searching by CSS class

注意： class是Python的保留字， 所以使用的时候， 用 class_替代 （class的最后躲一下划线）
# 例子1 搜索所有class为sister的a标签
>>> soup.find_all('a', class_='sister')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# 例子2 搜索class中带 itl的标签
>>> soup.find_all(class_=re.compile('itl'))
[<p class="title"><b>The Dormouse's story</b></p>]

# 例子3 使用函数作为参数 如果返回结果为True, 则matches这个标签
>>> def has_six_characters(css_class):
...     return css_class is not None and len(css_class) == 6
...
>>> soup.find_all(class_=has_six_characters)
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# 例子4 一个标签可以有多个值的属性 比如 class
>>> css_soup = BeautifulSoup('<p class="body strikeout"></p>')
>>> css_soup.find_all('p', class_='strikeout')
[<p class="body strikeout"></p>]
>>> css_soup.find_all('p', class_='body')
[<p class="body strikeout"></p>]

注： 对于有多个值的属性， 我们可以通过其中的一个值搜索到它们

# 例子5 不过 如果一起搜索  顺序不能颠倒
>>> css_soup.find_all('p', class_='strikeout body')
[]

# 例子6 我们可以通过CSS selector选择我们要的标签
>>> css_soup.select('p.strikeout.body')
[<p class="body strikeout"></p>]

# 例子7 对于不支持 class_的早期版本， 使用 attrs={}
>>> soup.find_all('a', attrs={'class': 'sister'})
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

The text argument

With text you can search for strings instead of tags. As with name and the keyword arguments, you can pass in a string, a regular expression, a list, a function, or the value True. Here are some examples:

# 例子1. 使用string作为参数
>>> soup.find_all(text='Elsie')
[u'Elsie']

# 例子2. 使用List作为参数
>>> soup.find_all(text=['Tillie', 'Elsie', 'Lacie'])
[u'Elsie', u'Lacie', u'Tillie']

# 例子3. 使用正则表达式作为参数
>>> soup.find_all(text=re.compile('Dormouse'))
[u"The Dormouse's story", u"The Dormouse's story"]

# 例子4. 使用函数作为参数
>>> def is_the_only_string_within_a_tag(s):
...     return (s == s.parent.string)
...
>>> soup.find_all(text=is_the_only_string_within_a_tag)
[u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']

# 例子5. 联合其他参数一起搜索
>>> soup.find_all('a', text='Elsie')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

The limit arguement

# 如果HTML文件很多， 解析速度就慢 这个时候 可以指定BeautifulSoup搜索的个数

# 例子: 只搜索符合条件的前两个结果
>>> soup.find_all('a', limit=2)
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

The recursive argument

解析的HTML
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>

# 例子：title是head下面的， 而不是html下面的（直接） 如果关闭递归， 就找不到title了。 也就是说， 开启递归， 不仅搜索儿子， 还搜索孙子。 如果关闭递归， 就只搜索儿子
>>> soup.html.find_all('title')
[<title>The Dormouse's story</title>]
>>> soup.html.find_all('title', recursive=False)
[]

Calling a tag is like calling find_all()

tag标签也可以使用find_all(), 像BeautifulSoup对象一样

# 下面这两个是相等的
soup.title.find_all(text=True)
soup.title(text=True)

find()

Signature: find(name, attrs, recursive, text, **kwargs)

find()的简单使用

# 例子1  这两个是等价 不过find_all返回的是所有结果中的前1个结果  而 find只是返回一个结果   find_all会搜索所有的文档 速度较慢
>>> soup.find_all()
[<title>The Dormouse's story</title>]
>>> soup.find('title')
<title>The Dormouse's story</title>

# 例子2 如果搜索不到相关的标签， find返回的是None  而find_all返回的是 list
>>> print(soup.find('nosuchtag'))
None
>>> print(soup.find_all('nosuchtag'))
[]

# 例子3  这两个是相等的
>>> soup.head.title
<title>The Dormouse's story</title>
>>> soup.find('head').find('title')
<title>The Dormouse's story</title>

find_parents() and find_parent()

Signature: find_parents(name, attrs, text, limit, **kwargs)

Signature: find_parent(name, attrs, text, **kwargs)

>>> a_string = soup.find(text='Lacie')
>>> a_string
u'Lacie'
>>> a_string.find_parents('a')
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
>>> a_string.find_parent('p')
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
>>> a_string.find_parents('p', class_='title')
[]

find_next_silbings() and find_next_sibling()

>>> first_link = soup.a
>>> first_link
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>> first_link.find_next_siblings('a')
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>> first_stroy_paragraph = soup.find('p', 'story')
>>> first_stroy_paragraph.find_next_sibling('p')
<p class="story">...</p>

find_previous_siblings() and find_previous_sibling()

Signature: find_previous_siblings(name, attrs, text, limit, **kwargs)

Signature: find_previous_sibling(name, attrs, text, **kwargs)

>>> last_link = soup.find('a', id='link3')
>>> last_link
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
>>> last_link.find_previous_siblings('a')
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
>>> first_story_paragraph = soup.find('p', 'story')
>>> first_story_paragraph.find_previous_sibling('p')
<p class="title"><b>The Dormouse's story</b></p>

find_all_next() and find_next()

Signature: find_all_next(name, attrs, text, limit, **kwargs)

Signature: find_next(name, attrs, text, **kwargs)

>>> first_link = soup.a
>>> first_link
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>> first_link.find_all_next(text=True)
[u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie', u';\nand they lived at the bottom of a well.', u'\n', u'...', u'\n']
>>> first_link.find_next('p')
<p class="story">...</p>

find_all_previous() and find_previous()

>>> first_link = soup.a
>>> first_link
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>> first_link.find_all_previous('p')
[<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="title"><b>The Dormouse's story</b></p>]
>>> first_link.find_previous('title')
<title>The Dormouse's story</title>

CSS selector 略

对于HTML的修改删除添加略

Beautiful Soup Documentation

http://www.crummy.com/software/BeautifulSoup/bs4/doc

Python -- BeautifulSoup的学习使用的更多相关文章

python beautifulsoup爬虫学习
BeautifulSoup(page_html, "lxml").select(),这里可以通过浏览器开发者模式选择copy selector,并且并不需要完整路径. github ...
Python 应用领域及学习重点
笔者认为不管学习什么编程语言,首先要知道:学完之后在未来能做些什么? 本文将浅谈 Python 的应用领域及其在对应领域的学习重点.也仅是介绍了 Python 应用领域的"冰山一角" ...
一个Python爬虫工程师学习养成记
大数据的时代,网络爬虫已经成为了获取数据的一个重要手段. 但要学习好爬虫并没有那么简单.首先知识点和方向实在是太多了,它关系到了计算机网络.编程基础.前端开发.后端开发.App 开发与逆向.网络安全. ...
PyQt（Python+Qt）学习随笔：富文本编辑器QTextEdit功能详解
专栏:Python基础教程目录专栏:使用PyQt开发图形界面Python应用专栏:PyQt入门学习老猿Python博文目录一.概述 QTextEdit是一个高级的所见即所得的文档查看器和编辑器 ...
Python爬虫系统学习(1)
Python爬虫系统化学习(1) 前言:爬虫的学习对生活中很多事情都很有帮助,比如买房的时候爬取房价,爬取影评之类的,学习爬虫也是在提升对Python的掌握,所以我准备用2-3周的晚上时间,提升自己对 ...
Python爬虫系统化学习(4)
Python爬虫系统化学习(4) 在之前的学习过程中,我们学习了如何爬取页面,对页面进行解析并且提取我们需要的数据. 在通过解析得到我们想要的数据后,最重要的步骤就是保存数据. 一般的数据存储方式有两 ...
Python爬虫系统化学习(5)
Python爬虫系统化学习(5) 多线程爬虫,在之前的网络编程中,我学习过多线程socket进行单服务器对多客户端的连接,通过使用多线程编程,可以大大提升爬虫的效率. Python多线程爬虫主要由三部 ...
Python 装饰器学习
Python装饰器学习(九步入门) 这是在Python学习小组上介绍的内容,现学现卖.多练习是好的学习方式. 第一步:最简单的函数,准备附加额外功能 1 2 3 4 5 6 7 8 # -*- c ...
Requests:Python HTTP Module学习笔记（一）（转）
Requests:Python HTTP Module学习笔记(一) 在学习用python写爬虫的时候用到了Requests这个Http网络库,这个库简单好用并且功能强大,完全可以代替python的标 ...

随机推荐

基于spring mvc的注解DEMO完整例子
弃用了struts,用spring mvc框架做了几个项目,感觉都不错,而且使用了注解方式,可以省掉一大堆配置文件.本文主要介绍使用注解方式配置的spring mvc,之前写的spring3.0 mv ...
Project facet Java version 1.8 is not supported解决记录
一看知道是因为jdk版本不一致所导致,如何解决? 方法一: 选中项目 Properties , 选择 Project Facets,右击选择 Java , Change Version 方法二: 在 ...
【bzoj2115】 Xor
www.lydsy.com/JudgeOnline/problem.php?id=2115 (题目链接) 题意给出一张图,可能有重边和自环,在图中找出一条从1-n的路径,使得经过的路径的权值的异或和 ...
UVA1220Party at Hali-Bula(树的最大独立集 + 唯一性判断)
http://acm.hust.edu.cn/vjudge/contest/view.action?cid=105116#problem/H 紫书P282 员工和直属老板只能选一个,最多选多少人思路 ...
struts2.3.16所需的基本的jar包
jar包放多了就报Exception什么Unable to load....上网搜了半天也没有能解决的下面所说的jar包放到WEB-INF/lib以及tomcat/lib中通过我一个一个添加到to ...
IDS IPS WAF之安全剖析
现在市场上的主流网络安全产品可以分为以下几个大类: 1.基础防火墙类,主要是可实现基本包过滤策略的防火墙,这类是有硬件处理.软件处理等,其主要功能实现是限制对IP:port的访问.基本上的实现都是默认 ...
php一些常用函数的理解
mysql_result($res, $row, [$field=0])是获取查询结果集中的某一个单元的内容. 其中, $row是行偏移, $field是列偏移, 或者叫索引, 都是从0开始的. ...
matlab怎么定义一个数组
A=[];n=input('n=');%数组的长度for i=1:n fprintf('a%.0f=',i); x=input('');%分别输入各个数的值 A=[A,x];endA就可以得到长度为n ...
使用Fabric进行crash收集统计
主要是帮助自己记一下地址. 1 申请Crashlytics服务:http://try.crashlytics.com 2 下载Fabric客户端,帮助集成Crashlytics到自己的项目中:http ...
[WPF学习笔记]动态加载XAML
好久没写Blogs了,现在在看[WPF编程宝典],决定开始重新写博客,和大家一起分享技术. 在编程时我们常希望界面是动态的,可以随时变换而不需要重新编译自己的代码. 以下是动态加载XAML的一个事例代 ...

Python -- BeautifulSoup的学习使用

Python -- BeautifulSoup的学习使用的更多相关文章

随机推荐

热门专题