目录

一、requests模块
二、Beautifulsoup模块

一、requests模块

1、介绍

Python标准库中提供了:urllib、urllib2、httplib等模块以供Http请求,但是,它的 API 太渣了。它是为另一个时代、另一个互联网所创建的。它需要巨量的工作,甚至包括各种方法覆盖,来完成最简单的任务。

Requests 是使用 Apache2 Licensed 许可证的 基于Python开发的HTTP 库,其在Python内置模块的基础上进行了高度的封装,从而使得Pythoner进行网络请求时,变得美好了许多,使用Requests可以轻而易举的完成浏览器可有的任何操作。

2、请求method介绍

(1)GET请求

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 1、无参数实例
  
import requests
  
ret = requests.get('https://github.com/timeline.json')
ret.encoding = "gbk"
print(ret.url)
print(ret.text)  # str
print(ret.content)  # bytes
  
  
# 2、有参数实例
  
import requests
  
payload = {'key1''value1''key2''value2'}
ret = requests.get("http://httpbin.org/get", params=payload)
  
print ret.url
print ret.text

(2)POST请求

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# 1、基本POST实例
  
import requests
  
payload = {'key1''value1''key2''value2'}
ret = requests.post("http://httpbin.org/post", data=payload)
  
print ret.text
  
  
# 2、发送请求头和数据实例
  
import requests
import json
  
url = 'https://api.github.com/some/endpoint'
payload = {'some''data'}
headers = {'content-type''application/json'}
  
ret = requests.post(url, data=json.dumps(payload), headers=headers)
  
print ret.text
print ret.cookies

(3)其他请求

1
2
3
4
5
6
7
8
9
10
requests.get(url, params=None**kwargs)
requests.post(url, data=None, json=None**kwargs)
requests.put(url, data=None**kwargs)
requests.head(url, **kwargs)
requests.delete(url, **kwargs)
requests.patch(url, data=None**kwargs)
requests.options(url, **kwargs)
  
# 以上方法均是在此方法的基础上构建
requests.request(method, url, **kwargs)

(4)代理

  1. import requests
  2. url = "http://www.baidu.com/"
  3. headers = {
  4. "User-Agent":'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0'
  5. }
  6. proxies={
  7. 'http':'113.200.56.13:8010', # 网上随便找的一个代理,不稳定
  8. }
  9. res = requests.get(url=url,proxies=proxies)
  10. res.encoding = "utf8"
  11. print(res.text) # 实验成功

3、参数说明

  1. def request(method, url, **kwargs):
  2. """Constructs and sends a :class:`Request <Request>`.
  3.  
  4. :param method: method for the new :class:`Request` object.
  5. :param url: URL for the new :class:`Request` object.
  6. :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
  7. :param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`.
  8. :param json: (optional) json data to send in the body of the :class:`Request`.
  9. :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
  10. :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
  11. :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload.
  12. ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')``
  13. or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string
  14. defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers
  15. to add for the file.
  16. :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
  17. :param timeout: (optional) How long to wait for the server to send data
  18. before giving up, as a float, or a :ref:`(connect timeout, read
  19. timeout) <timeouts>` tuple.
  20. :type timeout: float or tuple
  21. :param allow_redirects: (optional) Boolean. Set to True if POST/PUT/DELETE redirect following is allowed.
  22. :type allow_redirects: bool
  23. :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
  24. :param verify: (optional) whether the SSL cert will be verified. A CA_BUNDLE path can also be provided. Defaults to ``True``.
  25. :param stream: (optional) if ``False``, the response content will be immediately downloaded.
  26. :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
  27. :return: :class:`Response <Response>` object
  28. :rtype: requests.Response
  29.  
  30. Usage::
  31.  
  32. >>> import requests
  33. >>> req = requests.request('GET', 'http://httpbin.org/get')
  34. <Response [200]>
  35. """
  36.  
  37. 参数列表

参数说明

  1. def param_method_url():
  2. # requests.request(method='get', url='http://127.0.0.1:8000/test/')
  3. # requests.request(method='post', url='http://127.0.0.1:8000/test/')
  4. pass
  5.  
  6. def param_param():
  7. # - 可以是字典
  8. # - 可以是字符串
  9. # - 可以是字节(ascii编码以内)
  10.  
  11. # requests.request(method='get',
  12. # url='http://127.0.0.1:8000/test/',
  13. # params={'k1': 'v1', 'k2': '水电费'})
  14.  
  15. # requests.request(method='get',
  16. # url='http://127.0.0.1:8000/test/',
  17. # params="k1=v1&k2=水电费&k3=v3&k3=vv3")
  18.  
  19. # requests.request(method='get',
  20. # url='http://127.0.0.1:8000/test/',
  21. # params=bytes("k1=v1&k2=k2&k3=v3&k3=vv3", encoding='utf8'))
  22.  
  23. # 错误
  24. # requests.request(method='get',
  25. # url='http://127.0.0.1:8000/test/',
  26. # params=bytes("k1=v1&k2=水电费&k3=v3&k3=vv3", encoding='utf8'))
  27. pass
  28.  
  29. def param_data():
  30. # 可以是字典
  31. # 可以是字符串
  32. # 可以是字节
  33. # 可以是文件对象
  34.  
  35. # requests.request(method='POST',
  36. # url='http://127.0.0.1:8000/test/',
  37. # data={'k1': 'v1', 'k2': '水电费'})
  38.  
  39. # requests.request(method='POST',
  40. # url='http://127.0.0.1:8000/test/',
  41. # data="k1=v1; k2=v2; k3=v3; k3=v4"
  42. # )
  43.  
  44. # requests.request(method='POST',
  45. # url='http://127.0.0.1:8000/test/',
  46. # data="k1=v1;k2=v2;k3=v3;k3=v4",
  47. # headers={'Content-Type': 'application/x-www-form-urlencoded'}
  48. # )
  49.  
  50. # requests.request(method='POST',
  51. # url='http://127.0.0.1:8000/test/',
  52. # data=open('data_file.py', mode='r', encoding='utf-8'), # 文件内容是:k1=v1;k2=v2;k3=v3;k3=v4
  53. # headers={'Content-Type': 'application/x-www-form-urlencoded'}
  54. # )
  55. pass
  56.  
  57. def param_json():
  58. # 将json中对应的数据进行序列化成一个字符串,json.dumps(...)
  59. # 然后发送到服务器端的body中,并且Content-Type是 {'Content-Type': 'application/json'}
  60. requests.request(method='POST',
  61. url='http://127.0.0.1:8000/test/',
  62. json={'k1': 'v1', 'k2': '水电费'})
  63.  
  64. def param_headers():
  65. # 发送请求头到服务器端
  66. requests.request(method='POST',
  67. url='http://127.0.0.1:8000/test/',
  68. json={'k1': 'v1', 'k2': '水电费'},
  69. headers={'Content-Type': 'application/x-www-form-urlencoded'}
  70. )
  71.  
  72. def param_cookies():
  73. # 发送Cookie到服务器端
  74. requests.request(method='POST',
  75. url='http://127.0.0.1:8000/test/',
  76. data={'k1': 'v1', 'k2': 'v2'},
  77. cookies={'cook1': 'value1'},
  78. )
  79. # 也可以使用CookieJar(字典形式就是在此基础上封装)
  80. from http.cookiejar import CookieJar
  81. from http.cookiejar import Cookie
  82.  
  83. obj = CookieJar()
  84. obj.set_cookie(Cookie(version=0, name='c1', value='v1', port=None, domain='', path='/', secure=False, expires=None,
  85. discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False,
  86. port_specified=False, domain_specified=False, domain_initial_dot=False, path_specified=False)
  87. )
  88. requests.request(method='POST',
  89. url='http://127.0.0.1:8000/test/',
  90. data={'k1': 'v1', 'k2': 'v2'},
  91. cookies=obj)
  92.  
  93. def param_files():
  94. # 发送文件
  95. # file_dict = {
  96. # 'f1': open('readme', 'rb')
  97. # }
  98. # requests.request(method='POST',
  99. # url='http://127.0.0.1:8000/test/',
  100. # files=file_dict)
  101.  
  102. # 发送文件,定制文件名
  103. # file_dict = {
  104. # 'f1': ('test.txt', open('readme', 'rb'))
  105. # }
  106. # requests.request(method='POST',
  107. # url='http://127.0.0.1:8000/test/',
  108. # files=file_dict)
  109.  
  110. # 发送文件,定制文件名
  111. # file_dict = {
  112. # 'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf")
  113. # }
  114. # requests.request(method='POST',
  115. # url='http://127.0.0.1:8000/test/',
  116. # files=file_dict)
  117.  
  118. # 发送文件,定制文件名
  119. # file_dict = {
  120. # 'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf", 'application/text', {'k1': '0'})
  121. # }
  122. # requests.request(method='POST',
  123. # url='http://127.0.0.1:8000/test/',
  124. # files=file_dict)
  125.  
  126. pass
  127.  
  128. def param_auth():
  129. from requests.auth import HTTPBasicAuth, HTTPDigestAuth
  130.  
  131. ret = requests.get('https://api.github.com/user', auth=HTTPBasicAuth('wupeiqi', 'sdfasdfasdf'))
  132. print(ret.text)
  133.  
  134. # ret = requests.get('http://192.168.1.1',
  135. # auth=HTTPBasicAuth('admin', 'admin'))
  136. # ret.encoding = 'gbk'
  137. # print(ret.text)
  138.  
  139. # ret = requests.get('http://httpbin.org/digest-auth/auth/user/pass', auth=HTTPDigestAuth('user', 'pass'))
  140. # print(ret)
  141. #
  142.  
  143. def param_timeout():
  144. # ret = requests.get('http://google.com/', timeout=1)
  145. # print(ret)
  146.  
  147. # ret = requests.get('http://google.com/', timeout=(5, 1))
  148. # print(ret)
  149. pass
  150.  
  151. def param_allow_redirects():
  152. ret = requests.get('http://127.0.0.1:8000/test/', allow_redirects=False)
  153. print(ret.text)
  154.  
  155. def param_proxies():
  156. # proxies = {
  157. # "http": "61.172.249.96:80",
  158. # "https": "http://61.185.219.126:3128",
  159. # }
  160.  
  161. # proxies = {'http://10.20.1.128': 'http://10.10.1.10:5323'}
  162.  
  163. # ret = requests.get("http://www.proxy360.cn/Proxy", proxies=proxies)
  164. # print(ret.headers)
  165.  
  166. # from requests.auth import HTTPProxyAuth
  167. #
  168. # proxyDict = {
  169. # 'http': '77.75.105.165',
  170. # 'https': '77.75.105.165'
  171. # }
  172. # auth = HTTPProxyAuth('username', 'mypassword')
  173. #
  174. # r = requests.get("http://www.google.com", proxies=proxyDict, auth=auth)
  175. # print(r.text)
  176.  
  177. pass
  178.  
  179. def param_stream():
  180. ret = requests.get('http://127.0.0.1:8000/test/', stream=True)
  181. print(ret.content)
  182. ret.close()
  183.  
  184. # from contextlib import closing
  185. # with closing(requests.get('http://httpbin.org/get', stream=True)) as r:
  186. # # 在此处理响应。
  187. # for i in r.iter_content():
  188. # print(i)
  189.  
  190. def requests_session():
  191. import requests
  192.  
  193. session = requests.Session()
  194.  
  195. ### 1、首先登陆任何页面,获取cookie
  196.  
  197. i1 = session.get(url="http://dig.chouti.com/help/service")
  198.  
  199. ### 2、用户登陆,携带上一次的cookie,后台对cookie中的 gpsd 进行授权
  200. i2 = session.post(
  201. url="http://dig.chouti.com/login",
  202. data={
  203. 'phone': "",
  204. 'password': "xxxxxx",
  205. 'oneMonth': ""
  206. }
  207. )
  208.  
  209. i3 = session.post(
  210. url="http://dig.chouti.com/link/vote?linksId=8589623",
  211. )
  212. print(i3.text)

参数示例

官方文档:http://cn.python-requests.org/zh_CN/latest/user/quickstart.html#id4

二、Beautifulsoup模块

可参考:https://blog.csdn.net/xxf813/article/details/81605197

1、介绍

BeautifulSoup是一个模块,该模块用于接收一个HTML或XML字符串,然后将其进行格式化,之后遍可以使用他提供的方法进行快速查找指定元素,从而使得在HTML或XML中查找指定元素变得简单。

2、简单使用

pip3 install beautifulsoup4

  1. from bs4 import BeautifulSoup
  2.  
  3. html_doc = """
  4. <html><head><title>The Dormouse's story</title></head>
  5. <body>
  6. asdf
  7. <div class="title">
  8. <b>The Dormouse's story总共</b>
  9. <h1>f</h1>
  10. </div>
  11. <div class="story">Once upon a time there were three little sisters; and their names were
  12. <a class="sister0" id="link1">Els<span>f</span>ie</a>,
  13. <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
  14. <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
  15. and they lived at the bottom of a well.</div>
  16. ad<br/>sf
  17. <p class="story">...</p>
  18. </body>
  19. </html>
  20. """
  21.  
  22. soup = BeautifulSoup(html_doc, features="lxml")
  23. # 找到第一个a标签
  24. tag1 = soup.find(name='a')
  25. # 找到所有的a标签
  26. tag2 = soup.find_all(name='a')
  27. # 找到id=link2的标签
  28. tag3 = soup.select('#link2')

简单使用

3、方法说明

1. name,标签名称

  1. # tag = soup.find('a')
  2. # name = tag.name # 获取
  3. # print(name)
  4. # tag.name = 'span' # 设置
  5. # print(soup)

name

2. attrs,标签属性

  1. # tag = soup.find('a')
  2. # attrs = tag.attrs # 获取
  3. # print(attrs)
  4. # tag.attrs = {'ik':123} # 设置
  5. # tag.attrs['id'] = 'iiiii' # 设置
  6. # print(soup)

attr

3. children,所有子标签

  1. # body = soup.find('body')
  2. # v = body.children

children

4. descendants,所有子子孙孙标签

  1. # body = soup.find('body')
  2. # v = body.descendants

descendants

5. clear,将标签的所有子标签全部清空(保留标签名)

  1. # tag = soup.find('body')
  2. # tag.clear()
  3. # print(soup)

clear

6. decompose,递归的删除所有的标签

  1. # body = soup.find('body')
  2. # body.decompose()
  3. # print(soup)

decompose

7. extract,递归的删除所有的标签,并获取删除的标签。(跟父有关系)

  1. # body = soup.find('body')
  2. # v = body.extract()
  3. # print(soup)

extract

8. decode,转换为字符串(含当前标签);decode_contents(不含当前标签)

  1. # body = soup.find('body')
  2. # v = body.decode()
  3. # v = body.decode_contents()
  4. # print(v)

decode

9. encode,转换为字节(含当前标签);encode_contents(不含当前标签)

  1. # body = soup.find('body')
  2. # v = body.encode()
  3. # v = body.encode_contents()
  4. # print(v)

encode

10. find,获取匹配的第一个标签

  1. # tag = soup.find('a')
  2. # print(tag)
  3. # tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
  4. # tag = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
  5. # print(tag)

find

11. find_all,获取匹配的所有标签

  1. # tags = soup.find_all('a')
  2. # print(tags)
  3.  
  4. # tags = soup.find_all('a',limit=1)
  5. # print(tags)
  6.  
  7. # tags = soup.find_all(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
  8. # # tags = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
  9. # print(tags)
  10.  
  11. # ####### 列表 #######
  12. # v = soup.find_all(name=['a','div'])
  13. # print(v)
  14.  
  15. # v = soup.find_all(class_=['sister0', 'sister'])
  16. # print(v)
  17.  
  18. # v = soup.find_all(text=['Tillie'])
  19. # print(v, type(v[0]))
  20.  
  21. # v = soup.find_all(id=['link1','link2'])
  22. # print(v)
  23.  
  24. # v = soup.find_all(href=['link1','link2'])
  25. # print(v)
  26.  
  27. # ####### 正则 #######
  28. import re
  29. # rep = re.compile('p')
  30. # rep = re.compile('^p')
  31. # v = soup.find_all(name=rep)
  32. # print(v)
  33.  
  34. # rep = re.compile('sister.*')
  35. # v = soup.find_all(class_=rep)
  36. # print(v)
  37.  
  38. # rep = re.compile('http://www.oldboy.com/static/.*')
  39. # v = soup.find_all(href=rep)
  40. # print(v)
  41.  
  42. # ####### 方法筛选 #######
  43. # def func(tag):
  44. # return tag.has_attr('class') and tag.has_attr('id')
  45. # v = soup.find_all(name=func)
  46. # print(v)
  47.  
  48. # ## get,获取标签属性
  49. # tag = soup.find('a')
  50. # v = tag.get('id')
  51. # print(v)

find_all

12. has_attr,检查标签是否具有该属性

  1. # tag = soup.find('a')
  2. # v = tag.has_attr('id')
  3. # print(v)

has_attr

13. get_text,获取标签内部文本内容

  1. # tag = soup.find('a')
  2. # v = tag.get_text('id')
  3. # print(v)

get_text

14. index,检查标签在某标签中的索引位置

  1. # tag = soup.find('body')
  2. # v = tag.index(tag.find('div'))
  3. # print(v)
  4.  
  5. # tag = soup.find('body')
  6. # for i,v in enumerate(tag):
  7. # print(i,v)

index

15. is_empty_element,是否是空标签(是否可以是空)或者自闭合标签,

判断是否是如下标签:'br' , 'hr', 'input', 'img', 'meta','spacer', 'link', 'frame', 'base'

  1. # tag = soup.find('br')
  2. # v = tag.is_empty_element
  3. # print(v)

is_empty_element

16. 当前的关联标签

  1. # soup.next
  2. # soup.next_element
  3. # soup.next_elements
  4. # soup.next_sibling
  5. # soup.next_siblings
  6.  
  7. #
  8. # tag.previous
  9. # tag.previous_element
  10. # tag.previous_elements
  11. # tag.previous_sibling
  12. # tag.previous_siblings
  13.  
  14. #
  15. # tag.parent
  16. # tag.parents

当前的关联标签(next、previous等)

17. 查找某标签的关联标签

  1. # tag.find_next(...)
  2. # tag.find_all_next(...)
  3. # tag.find_next_sibling(...)
  4. # tag.find_next_siblings(...)
  5.  
  6. # tag.find_previous(...)
  7. # tag.find_all_previous(...)
  8. # tag.find_previous_sibling(...)
  9. # tag.find_previous_siblings(...)
  10.  
  11. # tag.find_parent(...)
  12. # tag.find_parents(...)
  13.  
  14. # 参数同find_all

查看某标签的关联标签

18. select,select_one, CSS选择器

  1. soup.select("title")
  2.  
  3. soup.select("p nth-of-type(3)")
  4.  
  5. soup.select("body a")
  6.  
  7. soup.select("html head title")
  8.  
  9. tag = soup.select("span,a")
  10.  
  11. soup.select("head > title")
  12.  
  13. soup.select("p > a")
  14.  
  15. soup.select("p > a:nth-of-type(2)")
  16.  
  17. soup.select("p > #link1")
  18.  
  19. soup.select("body > a")
  20.  
  21. soup.select("#link1 ~ .sister")
  22.  
  23. soup.select("#link1 + .sister")
  24.  
  25. soup.select(".sister")
  26.  
  27. soup.select("[class~=sister]")
  28.  
  29. soup.select("#link1")
  30.  
  31. soup.select("a#link2")
  32.  
  33. soup.select('a[href]')
  34.  
  35. soup.select('a[href="http://example.com/elsie"]')
  36.  
  37. soup.select('a[href^="http://example.com/"]')
  38.  
  39. soup.select('a[href$="tillie"]')
  40.  
  41. soup.select('a[href*=".com/el"]')
  42.  
  43. from bs4.element import Tag
  44.  
  45. def default_candidate_generator(tag):
  46. for child in tag.descendants:
  47. if not isinstance(child, Tag):
  48. continue
  49. if not child.has_attr('href'):
  50. continue
  51. yield child
  52.  
  53. tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator)
  54. print(type(tags), tags)
  55.  
  56. from bs4.element import Tag
  57. def default_candidate_generator(tag):
  58. for child in tag.descendants:
  59. if not isinstance(child, Tag):
  60. continue
  61. if not child.has_attr('href'):
  62. continue
  63. yield child
  64.  
  65. tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator, limit=1)
  66. print(type(tags), tags)

select,select_one,CSS选择器

19. 标签的内容

  1. # tag = soup.find('span')
  2. # print(tag.string) # 获取
  3. # tag.string = 'new content' # 设置
  4. # print(soup)
  5.  
  6. # tag = soup.find('body')
  7. # print(tag.string)
  8. # tag.string = 'xxx'
  9. # print(soup)
  10.  
  11. # tag = soup.find('body')
  12. # v = tag.stripped_strings # 递归内部获取所有标签的文本
  13. # print(v)

标签内容(string等)

20.append在当前标签内部追加一个标签

  1. # tag = soup.find('body')
  2. # tag.append(soup.find('a'))
  3. # print(soup)
  4. #
  5. # from bs4.element import Tag
  6. # obj = Tag(name='i',attrs={'id': 'it'})
  7. # obj.string = '我是一个新来的'
  8. # tag = soup.find('body')
  9. # tag.append(obj)
  10. # print(soup)

append

21.insert在当前标签内部指定位置插入一个标签

  1. # from bs4.element import Tag
  2. # obj = Tag(name='i', attrs={'id': 'it'})
  3. # obj.string = '我是一个新来的'
  4. # tag = soup.find('body')
  5. # tag.insert(2, obj)
  6. # print(soup)

insert

22. insert_after,insert_before 在当前标签后面或前面插入

  1. # from bs4.element import Tag
  2. # obj = Tag(name='i', attrs={'id': 'it'})
  3. # obj.string = '我是一个新来的'
  4. # tag = soup.find('body')
  5. # # tag.insert_before(obj)
  6. # tag.insert_after(obj)
  7. # print(soup)

insert_after,insert_before

23. replace_with 在当前标签替换为指定标签

  1. # from bs4.element import Tag
  2. # obj = Tag(name='i', attrs={'id': 'it'})
  3. # obj.string = '我是一个新来的'
  4. # tag = soup.find('div')
  5. # tag.replace_with(obj)
  6. # print(soup)

replace_with

24. 创建标签之间的关系

  1. # tag = soup.find('div')
  2. # a = soup.find('a')
  3. # tag.setup(previous_sibling=a)
  4. # print(tag.previous_sibling)

创建标签之间的关系

25. wrap,将指定标签把当前标签包裹起来

  1. # from bs4.element import Tag
  2. # obj1 = Tag(name='div', attrs={'id': 'it'})
  3. # obj1.string = '我是一个新来的'
  4. #
  5. # tag = soup.find('a')
  6. # v = tag.wrap(obj1)
  7. # print(soup)
  8.  
  9. # tag = soup.find('a')
  10. # v = tag.wrap(soup.find('p'))
  11. # print(soup)

wrap

26. unwrap,去掉当前标签,将保留其包裹的标签

  1. # tag = soup.find('a')
  2. # v = tag.unwrap()
  3. # print(soup)

unwrap

更多参数官方:http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

示例

  1. # # -*- coding:utf-8 -*-
  2. # import requests
  3. # from bs4 import BeautifulSoup
  4. #
  5. #
  6. # response = requests.get("http://www.autohome.com.cn/news/")
  7. # response.encoding = "gbk"
  8. #
  9. # # print(type(response.text)) # str
  10. #
  11. #
  12. # soup = BeautifulSoup(response.text,"html.parser")
  13. #
  14. # tag = soup.find(id='auto0channel0lazyload-article')
  15. # print(tag)
  16. # tag = soup.find(name="h3")
  17. # print(tag)
  18. # tag = soup.find(name="h3",attr={"class":"xxx"})
  19.  
  20. # 所有新闻
  21. # 标题,简介,url,图片
  22.  
  23. import requests
  24. from bs4 import BeautifulSoup
  25.  
  26. #
  27. html = requests.get("http://www.autohome.com.cn/news/")
  28. html.encoding = "gbk"
  29. print(type(html.text)) # <class 'str'>
  30. print(type(html.content)) # <class 'bytes'>
  31.  
  32. soup = BeautifulSoup(html,"html.parser")
  33.  
  34. li_list = soup.find(id="auto-channel-lazyload-article").find_all(name="li")
  35. for li in li_list:
  36. title = li.find(name="h3")
  37. if not title:
  38. continue
  39. print('\033[1;32m[标题]\033[0m',title.text)
  40. summary = li.find(name="p")
  41. print('\033[1;33m[简介]\033[0m',summary.text)
  42.  
  43. # attrs = li.find("a").attrs
  44. # print(attrs)
  45. url = li.find("a").get("href")
  46. print('\033[1;34m[url]\033[0m',url)
  47.  
  48. img = li.find("img").get("src")
  49. print('\033[1;35m[img]\033[0m',img)
  50.  
  51. res = requests.get("http:%s" % (img,))
  52.  
  53. # file_name = img.rsplit("/",1)[1]
  54. # with open(file_name,"wb") as f:
  55. # f.write(res.content)
  56.  
  57. print("\033[1;31m==================================================================\033[0m")

汽车之家 某些页面

  1. # -*- coding:utf-8 -*-
  2.  
  3. import requests
  4. from bs4 import BeautifulSoup
  5.  
  6. # 获取token
  7. r1 = requests.get("https://github.com/login")
  8. r1.encoding = "gbk"
  9. s1 = BeautifulSoup(r1.text,"html.parser")
  10. token = s1.find(name="input",attrs={"name":"authenticity_token"}).get("value")
  11. r1_cookie_dict = r1.cookies.get_dict()
  12. print(token)
  13.  
  14. # 将用户名、密码、token发送到post的url
  15. '''
  16. commit:Sign in
  17. utf8:✓
  18. authenticity_token:r31RX8eQeShWRxUnEyYXtQHVmIlrw6sZmwdyy/IYP0dCzV1m4covQQZz+d8qUuc9mT8qIxjjx0U3YjKN9ZvLHA==
  19. login:asdf
  20. password:asdf
  21. '''
  22. r2 = requests.post(
  23. url="https://github.com/session",
  24. data={
  25. "utf8":"✓",
  26. "authenticity_token":token,
  27. "login":"fat39@163.com",
  28. "password":"123!@#qwe",
  29. "commit":"Sign in"
  30. },
  31. cookies=r1_cookie_dict,
  32. )
  33.  
  34. r2_cookie_dict = r2.cookies.get_dict()
  35. print(r2_cookie_dict)
  36.  
  37. # 带cookie访问
  38. cookie_dict = {}
  39. cookie_dict.update(r1_cookie_dict)
  40. cookie_dict.update(r2_cookie_dict)
  41. r3 = requests.get(
  42. url="https://github.com/settings/emails",
  43. cookies=cookie_dict,
  44. )
  45.  
  46. print(r3.text)
  47.  
  48. ############################################
  49. #!/usr/bin/env python
  50. # -*- coding:utf-8 -*-
  51.  
  52. import requests
  53. from bs4 import BeautifulSoup
  54.  
  55. # ############## 方式一 ##############
  56. #
  57. # # 1. 访问登陆页面,获取 authenticity_token
  58. # i1 = requests.get('https://github.com/login')
  59. # soup1 = BeautifulSoup(i1.text, features='lxml')
  60. # tag = soup1.find(name='input', attrs={'name': 'authenticity_token'})
  61. # authenticity_token = tag.get('value')
  62. # c1 = i1.cookies.get_dict()
  63. # i1.close()
  64. #
  65. # # 1. 携带authenticity_token和用户名密码等信息,发送用户验证
  66. # form_data = {
  67. # "authenticity_token": authenticity_token,
  68. # "utf8": "",
  69. # "commit": "Sign in",
  70. # "login": "wupeiqi@live.com",
  71. # 'password': 'xxoo'
  72. # }
  73. #
  74. # i2 = requests.post('https://github.com/session', data=form_data, cookies=c1)
  75. # c2 = i2.cookies.get_dict()
  76. # c1.update(c2)
  77. # i3 = requests.get('https://github.com/settings/repositories', cookies=c1)
  78. #
  79. # soup3 = BeautifulSoup(i3.text, features='lxml')
  80. # list_group = soup3.find(name='div', class_='listgroup')
  81. #
  82. # from bs4.element import Tag
  83. #
  84. # for child in list_group.children:
  85. # if isinstance(child, Tag):
  86. # project_tag = child.find(name='a', class_='mr-1')
  87. # size_tag = child.find(name='small')
  88. # temp = "项目:%s(%s); 项目路径:%s" % (project_tag.get('href'), size_tag.string, project_tag.string, )
  89. # print(temp)
  90.  
  91. # ############## 方式二 ##############
  92. # session = requests.Session()
  93. # # 1. 访问登陆页面,获取 authenticity_token
  94. # i1 = session.get('https://github.com/login')
  95. # soup1 = BeautifulSoup(i1.text, features='lxml')
  96. # tag = soup1.find(name='input', attrs={'name': 'authenticity_token'})
  97. # authenticity_token = tag.get('value')
  98. # c1 = i1.cookies.get_dict()
  99. # i1.close()
  100. #
  101. # # 1. 携带authenticity_token和用户名密码等信息,发送用户验证
  102. # form_data = {
  103. # "authenticity_token": authenticity_token,
  104. # "utf8": "",
  105. # "commit": "Sign in",
  106. # "login": "wupeiqi@live.com",
  107. # 'password': 'xxoo'
  108. # }
  109. #
  110. # i2 = session.post('https://github.com/session', data=form_data)
  111. # c2 = i2.cookies.get_dict()
  112. # c1.update(c2)
  113. # i3 = session.get('https://github.com/settings/repositories')
  114. #
  115. # soup3 = BeautifulSoup(i3.text, features='lxml')
  116. # list_group = soup3.find(name='div', class_='listgroup')
  117. #
  118. # from bs4.element import Tag
  119. #
  120. # for child in list_group.children:
  121. # if isinstance(child, Tag):
  122. # project_tag = child.find(name='a', class_='mr-1')
  123. # size_tag = child.find(name='small')
  124. # temp = "项目:%s(%s); 项目路径:%s" % (project_tag.get('href'), size_tag.string, project_tag.string, )
  125. # print(temp)

自动登录github

  1. #!/usr/bin/env python
  2. # -*- coding:utf-8 -*-
  3. import time
  4.  
  5. import requests
  6. from bs4 import BeautifulSoup
  7.  
  8. session = requests.Session()
  9.  
  10. i1 = session.get(
  11. url='https://www.zhihu.com/#signin',
  12. headers={
  13. 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
  14. }
  15. )
  16.  
  17. soup1 = BeautifulSoup(i1.text, 'lxml')
  18. xsrf_tag = soup1.find(name='input', attrs={'name': '_xsrf'})
  19. xsrf = xsrf_tag.get('value')
  20.  
  21. current_time = time.time()
  22. i2 = session.get(
  23. url='https://www.zhihu.com/captcha.gif',
  24. params={'r': current_time, 'type': 'login'},
  25. headers={
  26. 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
  27. })
  28.  
  29. with open('zhihu.gif', 'wb') as f:
  30. f.write(i2.content)
  31.  
  32. captcha = input('请打开zhihu.gif文件,查看并输入验证码:')
  33. form_data = {
  34. "_xsrf": xsrf,
  35. 'password': 'xxooxxoo',
  36. "captcha": 'captcha',
  37. 'email': '424662508@qq.com'
  38. }
  39. i3 = session.post(
  40. url='https://www.zhihu.com/login/email',
  41. data=form_data,
  42. headers={
  43. 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
  44. }
  45. )
  46.  
  47. i4 = session.get(
  48. url='https://www.zhihu.com/settings/profile',
  49. headers={
  50. 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
  51. }
  52. )
  53.  
  54. soup4 = BeautifulSoup(i4.text, 'lxml')
  55. tag = soup4.find(id='rename-section')
  56. nick_name = tag.find('span',class_='name').string
  57. print(nick_name)
  58.  
  59. 知乎

知乎

  1. #!/usr/bin/env python
  2. # -*- coding:utf-8 -*-
  3. import re
  4. import json
  5. import base64
  6.  
  7. import rsa
  8. import requests
  9.  
  10. def js_encrypt(text):
  11. b64der = 'MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCp0wHYbg/NOPO3nzMD3dndwS0MccuMeXCHgVlGOoYyFwLdS24Im2e7YyhB0wrUsyYf0/nhzCzBK8ZC9eCWqd0aHbdgOQT6CuFQBMjbyGYvlVYU2ZP7kG9Ft6YV6oc9ambuO7nPZh+bvXH0zDKfi02prknrScAKC0XhadTHT3Al0QIDAQAB'
  12. der = base64.standard_b64decode(b64der)
  13.  
  14. pk = rsa.PublicKey.load_pkcs1_openssl_der(der)
  15. v1 = rsa.encrypt(bytes(text, 'utf8'), pk)
  16. value = base64.encodebytes(v1).replace(b'\n', b'')
  17. value = value.decode('utf8')
  18.  
  19. return value
  20.  
  21. session = requests.Session()
  22.  
  23. i1 = session.get('https://passport.cnblogs.com/user/signin')
  24. rep = re.compile("'VerificationToken': '(.*)'")
  25. v = re.search(rep, i1.text)
  26. verification_token = v.group(1)
  27.  
  28. form_data = {
  29. 'input1': js_encrypt('wptawy'),
  30. 'input2': js_encrypt('asdfasdf'),
  31. 'remember': False
  32. }
  33.  
  34. i2 = session.post(url='https://passport.cnblogs.com/user/signin',
  35. data=json.dumps(form_data),
  36. headers={
  37. 'Content-Type': 'application/json; charset=UTF-8',
  38. 'X-Requested-With': 'XMLHttpRequest',
  39. 'VerificationToken': verification_token}
  40. )
  41.  
  42. i3 = session.get(url='https://i.cnblogs.com/EditDiary.aspx')
  43.  
  44. print(i3.text)
  45.  
  46. 博客园

博客园

参考or转发

http://www.cnblogs.com/wupeiqi/articles/6283017.html

python之爬虫_模块的更多相关文章

  1. Python网络爬虫-requests模块(II)

    有些时候,我们在使用爬虫程序去爬取一些用户相关信息的数据(爬取张三“人人网”个人主页数据)时,如果使用之前requests模块常规操作时,往往达不到我们想要的目的,例如: #!/usr/bin/env ...

  2. Python网络爬虫-requests模块

    requests模块 requests模块是python中原生的基于网络请求的模块,其主要作用是用来模拟浏览器发起请求.功能强大,用法简洁高效.在爬虫领域中占据着半壁江山的地位. 如何使用reques ...

  3. python之爬虫_并发(串行、多线程、多进程、异步IO)

    并发 在编写爬虫时,性能的消耗主要在IO请求中,当单进程单线程模式下请求URL时必然会引起等待,从而使得请求整体变慢 import requests def fetch_async(url): res ...

  4. python 网络爬虫requests模块

    一.requests模块 requests模块是python中原生的基于网络请求的模块,其主要作用是用来模拟浏览器发起请求.功能强大,用法简洁高效. 1.1 模块介绍及请求过程 requests模块模 ...

  5. 06 Python网络爬虫requets模块高级用法

    一. 基于requests模块的cookie操作 - cookie概念: 当用户通过浏览器访问一个域名的时候,访问的web服务器会给客户端发送数据,以保持web服务器与客户端之间的状态保持,这些数据就 ...

  6. [b0020] python 归纳 (六)_模块变量作用域

    test_module2.py: # -*- coding: utf-8 -*-"""测试 模块变量的作用域 总结:1 其他模块的变量,在当前模块的任何地方,包括函数都可 ...

  7. Python网络爬虫_爬取Ajax动态加载和翻页时url不变的网页

    1 . 什么是 AJAX ? AJAX = 异步 JavaScript 和 XML. AJAX 是一种用于创建快速动态网页的技术. 通过在后台与服务器进行少量数据交换,AJAX 可以使网页实现异步更新 ...

  8. Python网络爬虫-xpath模块

    一.正解解析 单字符: . : 除换行以外所有字符 [] :[aoe] [a-w] 匹配集合中任意一个字符 \d :数字 [0-9] \D : 非数字 \w :数字.字母.下划线.中文 \W : 非\ ...

  9. 【python网络爬虫】之requests相关模块

    python网络爬虫的学习第一步 [python网络爬虫]之0 爬虫与反扒 [python网络爬虫]之一 简单介绍 [python网络爬虫]之二 python uillib库 [python网络爬虫] ...

随机推荐

  1. JVM(二)GC算法和垃圾收集器

    前言 垃圾收集器(Garbage Collection)通常被成为GC,诞生于1960年MIT的Lisp语言.上一篇介绍了Java运行时区域的各个部分,其中程序计数器.虚拟机栈.本地方法栈3个区域随线 ...

  2. 使用iometer测试

    对国产机进行测试 1.win7上安装测试 下载: 点击打开链接 双击安装即可. 2.ubuntu下配置: OS: Ubuntu 12.04LTS x86_64Kernel: 3.5.0-26-gene ...

  3. sql函数:开窗函数简介

    与聚合函数一样,开窗函数也是对行集组进行聚合计算,但是普通聚合函数每组只能返回一个值,而开窗函数可以每组返回多个值. 实验一比如我们想查询每个工资小于5000元的员工信息(城市以及年龄),并且在每行中 ...

  4. 打开 CRM 时,出现错误:"Invalid Action – The selected action was not valid"

    今天当所有用户在打开CRM时,都出现了一个错误提示 “Invalid Action – The selected action was not valid”. 打开服务器的 event viewer查 ...

  5. 通过R语言统计考研英语(二)单词出现频率

    通过R语言统计考研英语(二)单词出现频率 大家对英语考试并不陌生,首先是背单词,就是所谓的高频词汇.厚厚的一本单词,真的看的头大.最近结合自己刚学的R语言,为年底的考研做准备,想统计一下最近考研英语( ...

  6. 认识node

    node是一个基于Chrome V8引擎的ECMAScript运行环境,使用了ECMAScript语法规范.有了node之后,js文件就能运行在服务器端了,也可以用来创建web服务器. node的主要 ...

  7. Advanced Find and Replace(文件内容搜索替换工具)v7.8.1简体中文破解版

    Advanced Find and Replace是一款文件内容搜索工具,同时也是文件内容批量替换工具.支持通配符和正则表达式,方便快捷强大! 显示中文的方法:第二个菜单-Language-选 下载地 ...

  8. Java List添加元素

    import java.util.ArrayList; public class Test {     public static void main(String[] args) {         ...

  9. mfc 函数重载

    函数重载的概念 for循环中变量 一. 函数重载的概念 函数重载允许我们使用相同的函数名定义多个函数. 提示: 函数参数类型不同,可重载. 类型相同时,则需要参数个数不同. int max(int a ...

  10. 【转载】MFC动态创建控件及其消息响应函数

    原文:http://blog.sina.com.cn/s/blog_4a08244901014ok1.html 这几天专门调研了一下MFC中如何动态创建控件及其消息响应函数. 参考帖子如下: (1)h ...