requests

Python标准库中提供了:urllib、urllib2、httplib等模块以供Http请求,但是,它的 API 太渣了。它是为另一个时代、另一个互联网所创建的。它需要巨量的工作,甚至包括各种方法覆盖,来完成最简单的任务。

Requests 是使用 Apache2 Licensed 许可证的 基于Python开发的HTTP 库,其在Python内置模块的基础上进行了高度的封装,从而使得Pythoner进行网络请求时,变得美好了许多,使用Requests可以轻而易举的完成浏览器可有的任何操作。

1、GET请求

# 1、无参数实例

import requests

ret = requests.get('https://github.com/timeline.json')

print ret.url
print ret.text # 2、有参数实例 import requests payload = {'key1': 'value1', 'key2': 'value2'}
ret = requests.get("http://httpbin.org/get", params=payload) print ret.url
print ret.text

2、POST请求

# 1、基本POST实例

import requests

payload = {'key1': 'value1', 'key2': 'value2'}
ret = requests.post("http://httpbin.org/post", data=payload) print ret.text # 2、发送请求头和数据实例 import requests
import json url = 'https://api.github.com/some/endpoint'
payload = {'some': 'data'}
headers = {'content-type': 'application/json'} ret = requests.post(url, data=json.dumps(payload), headers=headers) print ret.text
print ret.cookies

3、其他请求

requests.get(url, params=None, **kwargs)
requests.post(url, data=None, json=None, **kwargs)
requests.put(url, data=None, **kwargs)
requests.head(url, **kwargs)
requests.delete(url, **kwargs)
requests.patch(url, data=None, **kwargs)
requests.options(url, **kwargs) # 以上方法均是在此方法的基础上构建
requests.request(method, url, **kwargs)

4、更多参数

def request(method, url, **kwargs):
"""Constructs and sends a :class:`Request <Request>`. :param method: method for the new :class:`Request` object.
:param url: URL for the new :class:`Request` object.
:param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
:param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`.
:param json: (optional) json data to send in the body of the :class:`Request`.
:param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
:param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
:param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload.
``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')``
or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string
defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers
to add for the file.
:param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
:param timeout: (optional) How long to wait for the server to send data
before giving up, as a float, or a :ref:`(connect timeout, read
timeout) <timeouts>` tuple.
:type timeout: float or tuple
:param allow_redirects: (optional) Boolean. Set to True if POST/PUT/DELETE redirect following is allowed.
:type allow_redirects: bool
:param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
:param verify: (optional) whether the SSL cert will be verified. A CA_BUNDLE path can also be provided. Defaults to ``True``.
:param stream: (optional) if ``False``, the response content will be immediately downloaded.
:param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
:return: :class:`Response <Response>` object
:rtype: requests.Response Usage:: >>> import requests
>>> req = requests.request('GET', 'http://httpbin.org/get')
<Response [200]>
"""

参数列表

def param_method_url():
# requests.request(method='get', url='http://127.0.0.1:8000/test/')
# requests.request(method='post', url='http://127.0.0.1:8000/test/')
pass def param_param():
# - 可以是字典
# - 可以是字符串
# - 可以是字节(ascii编码以内) # requests.request(method='get',
# url='http://127.0.0.1:8000/test/',
# params={'k1': 'v1', 'k2': '水电费'}) # requests.request(method='get',
# url='http://127.0.0.1:8000/test/',
# params="k1=v1&k2=水电费&k3=v3&k3=vv3") # requests.request(method='get',
# url='http://127.0.0.1:8000/test/',
# params=bytes("k1=v1&k2=k2&k3=v3&k3=vv3", encoding='utf8')) # 错误
# requests.request(method='get',
# url='http://127.0.0.1:8000/test/',
# params=bytes("k1=v1&k2=水电费&k3=v3&k3=vv3", encoding='utf8'))
pass def param_data():
# 可以是字典
# 可以是字符串
# 可以是字节
# 可以是文件对象 # requests.request(method='POST',
# url='http://127.0.0.1:8000/test/',
# data={'k1': 'v1', 'k2': '水电费'}) # requests.request(method='POST',
# url='http://127.0.0.1:8000/test/',
# data="k1=v1; k2=v2; k3=v3; k3=v4"
# ) # requests.request(method='POST',
# url='http://127.0.0.1:8000/test/',
# data="k1=v1;k2=v2;k3=v3;k3=v4",
# headers={'Content-Type': 'application/x-www-form-urlencoded'}
# ) # requests.request(method='POST',
# url='http://127.0.0.1:8000/test/',
# data=open('data_file.py', mode='r', encoding='utf-8'), # 文件内容是:k1=v1;k2=v2;k3=v3;k3=v4
# headers={'Content-Type': 'application/x-www-form-urlencoded'}
# )
pass def param_json():
# 将json中对应的数据进行序列化成一个字符串,json.dumps(...)
# 然后发送到服务器端的body中,并且Content-Type是 {'Content-Type': 'application/json'}
requests.request(method='POST',
url='http://127.0.0.1:8000/test/',
json={'k1': 'v1', 'k2': '水电费'}) def param_headers():
# 发送请求头到服务器端
requests.request(method='POST',
url='http://127.0.0.1:8000/test/',
json={'k1': 'v1', 'k2': '水电费'},
headers={'Content-Type': 'application/x-www-form-urlencoded'}
) def param_cookies():
# 发送Cookie到服务器端
requests.request(method='POST',
url='http://127.0.0.1:8000/test/',
data={'k1': 'v1', 'k2': 'v2'},
cookies={'cook1': 'value1'},
)
# 也可以使用CookieJar(字典形式就是在此基础上封装)
from http.cookiejar import CookieJar
from http.cookiejar import Cookie obj = CookieJar()
obj.set_cookie(Cookie(version=0, name='c1', value='v1', port=None, domain='', path='/', secure=False, expires=None,
discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False,
port_specified=False, domain_specified=False, domain_initial_dot=False, path_specified=False)
)
requests.request(method='POST',
url='http://127.0.0.1:8000/test/',
data={'k1': 'v1', 'k2': 'v2'},
cookies=obj) def param_files():
# 发送文件
# file_dict = {
# 'f1': open('readme', 'rb')
# }
# requests.request(method='POST',
# url='http://127.0.0.1:8000/test/',
# files=file_dict) # 发送文件,定制文件名
# file_dict = {
# 'f1': ('test.txt', open('readme', 'rb'))
# }
# requests.request(method='POST',
# url='http://127.0.0.1:8000/test/',
# files=file_dict) # 发送文件,定制文件名
# file_dict = {
# 'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf")
# }
# requests.request(method='POST',
# url='http://127.0.0.1:8000/test/',
# files=file_dict) # 发送文件,定制文件名
# file_dict = {
# 'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf", 'application/text', {'k1': '0'})
# }
# requests.request(method='POST',
# url='http://127.0.0.1:8000/test/',
# files=file_dict) pass def param_auth():
from requests.auth import HTTPBasicAuth, HTTPDigestAuth ret = requests.get('https://api.github.com/user', auth=HTTPBasicAuth('wupeiqi', 'sdfasdfasdf'))
print(ret.text) # ret = requests.get('http://192.168.1.1',
# auth=HTTPBasicAuth('admin', 'admin'))
# ret.encoding = 'gbk'
# print(ret.text) # ret = requests.get('http://httpbin.org/digest-auth/auth/user/pass', auth=HTTPDigestAuth('user', 'pass'))
# print(ret)
# def param_timeout():
# ret = requests.get('http://google.com/', timeout=1)
# print(ret) # ret = requests.get('http://google.com/', timeout=(5, 1))
# print(ret)
pass def param_allow_redirects():
ret = requests.get('http://127.0.0.1:8000/test/', allow_redirects=False)
print(ret.text) def param_proxies():
# proxies = {
# "http": "61.172.249.96:80",
# "https": "http://61.185.219.126:3128",
# } # proxies = {'http://10.20.1.128': 'http://10.10.1.10:5323'} # ret = requests.get("http://www.proxy360.cn/Proxy", proxies=proxies)
# print(ret.headers) # from requests.auth import HTTPProxyAuth
#
# proxyDict = {
# 'http': '77.75.105.165',
# 'https': '77.75.105.165'
# }
# auth = HTTPProxyAuth('username', 'mypassword')
#
# r = requests.get("http://www.google.com", proxies=proxyDict, auth=auth)
# print(r.text) pass def param_stream():
ret = requests.get('http://127.0.0.1:8000/test/', stream=True)
print(ret.content)
ret.close() # from contextlib import closing
# with closing(requests.get('http://httpbin.org/get', stream=True)) as r:
# # 在此处理响应。
# for i in r.iter_content():
# print(i) def requests_session():
import requests session = requests.Session() ### 1、首先登陆任何页面,获取cookie i1 = session.get(url="http://dig.chouti.com/help/service") ### 2、用户登陆,携带上一次的cookie,后台对cookie中的 gpsd 进行授权
i2 = session.post(
url="http://dig.chouti.com/login",
data={
'phone': "",
'password': "xxxxxx",
'oneMonth': ""
}
) i3 = session.post(
url="http://dig.chouti.com/link/vote?linksId=8589623",
)
print(i3.text)

参数示例

官方文档:http://cn.python-requests.org/zh_CN/latest/user/quickstart.html#id4

BeautifulSoup

BeautifulSoup是一个模块,该模块用于接收一个HTML或XML字符串,然后将其进行格式化,之后遍可以使用他提供的方法进行快速查找指定元素,从而使得在HTML或XML中查找指定元素变得简单。

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
asdf
<div class="title">
<b>The Dormouse's story总共</b>
<h1>f</h1>
</div>
<div class="story">Once upon a time there were three little sisters; and their names were
<a class="sister0" id="link1">Els<span>f</span>ie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</div>
ad<br/>sf
<p class="story">...</p>
</body>
</html>
""" soup = BeautifulSoup(html_doc, features="lxml")
# 找到第一个a标签
tag1 = soup.find(name='a')
# 找到所有的a标签
tag2 = soup.find_all(name='a')
# 找到id=link2的标签
tag3 = soup.select('#link2')

安装:

pip3 install beautifulsoup4

使用示例:

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
...
</body>
</html>
""" soup = BeautifulSoup(html_doc, features="lxml")

1. name,标签名称

# tag = soup.find('a')
# name = tag.name # 获取
# print(name)
# tag.name = 'span' # 设置
# print(soup)

2. attr,标签属性

# tag = soup.find('a')
# attrs = tag.attrs # 获取
# print(attrs)
# tag.attrs = {'ik':123} # 设置
# tag.attrs['id'] = 'iiiii' # 设置
# print(soup)

3. children,所有子标签

# body = soup.find('body')
# v = body.children

4. children,所有子子孙孙标签

# body = soup.find('body')
# v = body.descendants

5. clear,将标签的所有子标签全部清空(保留标签名)

# tag = soup.find('body')
# tag.clear()
# print(soup)

6. decompose,递归的删除所有的标签

# body = soup.find('body')
# body.decompose()
# print(soup)

7. extract,递归的删除所有的标签,并获取删除的标签

# body = soup.find('body')
# v = body.extract()
# print(soup)

8. decode,转换为字符串(含当前标签);decode_contents(不含当前标签)

# body = soup.find('body')
# v = body.decode()
# v = body.decode_contents()
# print(v)

9. encode,转换为字节(含当前标签);encode_contents(不含当前标签)

# body = soup.find('body')
# v = body.encode()
# v = body.encode_contents()
# print(v)

10. find,获取匹配的第一个标签

# tag = soup.find('a')
# print(tag)
# tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
# tag = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
# print(tag)

11. find_all,获取匹配的所有标签

# tags = soup.find_all('a')
# print(tags) # tags = soup.find_all('a',limit=1)
# print(tags) # tags = soup.find_all(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
# # tags = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
# print(tags) # ####### 列表 #######
# v = soup.find_all(name=['a','div'])
# print(v) # v = soup.find_all(class_=['sister0', 'sister'])
# print(v) # v = soup.find_all(text=['Tillie'])
# print(v, type(v[0])) # v = soup.find_all(id=['link1','link2'])
# print(v) # v = soup.find_all(href=['link1','link2'])
# print(v) # ####### 正则 #######
import re
# rep = re.compile('p')
# rep = re.compile('^p')
# v = soup.find_all(name=rep)
# print(v) # rep = re.compile('sister.*')
# v = soup.find_all(class_=rep)
# print(v) # rep = re.compile('http://www.oldboy.com/static/.*')
# v = soup.find_all(href=rep)
# print(v) # ####### 方法筛选 #######
# def func(tag):
# return tag.has_attr('class') and tag.has_attr('id')
# v = soup.find_all(name=func)
# print(v) # ## get,获取标签属性
# tag = soup.find('a')
# v = tag.get('id')
# print(v)

12. has_attr,检查标签是否具有该属性

# tag = soup.find('a')
# v = tag.has_attr('id')
# print(v)

13. get_text,获取标签内部文本内容

# tag = soup.find('a')
# v = tag.get_text('id')
# print(v)

14. index,检查标签在某标签中的索引位置

# tag = soup.find('body')
# v = tag.index(tag.find('div'))
# print(v) # tag = soup.find('body')
# for i,v in enumerate(tag):
# print(i,v)

15. is_empty_element,是否是空标签(是否可以是空)或者自闭合标签,

判断是否是如下标签:'br' , 'hr', 'input', 'img', 'meta','spacer', 'link', 'frame', 'base'

# tag = soup.find('br')
# v = tag.is_empty_element
# print(v)

16. 当前的关联标签

# soup.next
# soup.next_element
# soup.next_elements
# soup.next_sibling
# soup.next_siblings #
# tag.previous
# tag.previous_element
# tag.previous_elements
# tag.previous_sibling
# tag.previous_siblings #
# tag.parent
# tag.parents

17. 查找某标签的关联标签

# tag.find_next(...)
# tag.find_all_next(...)
# tag.find_next_sibling(...)
# tag.find_next_siblings(...) # tag.find_previous(...)
# tag.find_all_previous(...)
# tag.find_previous_sibling(...)
# tag.find_previous_siblings(...) # tag.find_parent(...)
# tag.find_parents(...) # 参数同find_all

18. select,select_one, CSS选择器

soup.select("title")

soup.select("p nth-of-type(3)")

soup.select("body a")

soup.select("html head title")

tag = soup.select("span,a")

soup.select("head > title")

soup.select("p > a")

soup.select("p > a:nth-of-type(2)")

soup.select("p > #link1")

soup.select("body > a")

soup.select("#link1 ~ .sister")

soup.select("#link1 + .sister")

soup.select(".sister")

soup.select("[class~=sister]")

soup.select("#link1")

soup.select("a#link2")

soup.select('a[href]')

soup.select('a[href="http://example.com/elsie"]')

soup.select('a[href^="http://example.com/"]')

soup.select('a[href$="tillie"]')

soup.select('a[href*=".com/el"]')

from bs4.element import Tag

def default_candidate_generator(tag):
for child in tag.descendants:
if not isinstance(child, Tag):
continue
if not child.has_attr('href'):
continue
yield child tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator)
print(type(tags), tags) from bs4.element import Tag
def default_candidate_generator(tag):
for child in tag.descendants:
if not isinstance(child, Tag):
continue
if not child.has_attr('href'):
continue
yield child tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator, limit=1)
print(type(tags), tags)

19. 标签的内容

# tag = soup.find('span')
# print(tag.string) # 获取
# tag.string = 'new content' # 设置
# print(soup) # tag = soup.find('body')
# print(tag.string)
# tag.string = 'xxx'
# print(soup) # tag = soup.find('body')
# v = tag.stripped_strings # 递归内部获取所有标签的文本
# print(v)

20.append在当前标签内部追加一个标签

# tag = soup.find('body')
# tag.append(soup.find('a'))
# print(soup)
#
# from bs4.element import Tag
# obj = Tag(name='i',attrs={'id': 'it'})
# obj.string = '我是一个新来的'
# tag = soup.find('body')
# tag.append(obj)
# print(soup)

21.insert在当前标签内部指定位置插入一个标签

# from bs4.element import Tag
# obj = Tag(name='i', attrs={'id': 'it'})
# obj.string = '我是一个新来的'
# tag = soup.find('body')
# tag.insert(2, obj)
# print(soup)

22. insert_after,insert_before 在当前标签后面或前面插入

# from bs4.element import Tag
# obj = Tag(name='i', attrs={'id': 'it'})
# obj.string = '我是一个新来的'
# tag = soup.find('body')
# # tag.insert_before(obj)
# tag.insert_after(obj)
# print(soup)

23. replace_with 在当前标签替换为指定标签

# from bs4.element import Tag
# obj = Tag(name='i', attrs={'id': 'it'})
# obj.string = '我是一个新来的'
# tag = soup.find('div')
# tag.replace_with(obj)
# print(soup)

24. 创建标签之间的关系

# tag = soup.find('div')
# a = soup.find('a')
# tag.setup(previous_sibling=a)
# print(tag.previous_sibling)

25. wrap,将指定标签把当前标签包裹起来

# from bs4.element import Tag
# obj1 = Tag(name='div', attrs={'id': 'it'})
# obj1.string = '我是一个新来的'
#
# tag = soup.find('a')
# v = tag.wrap(obj1)
# print(soup) # tag = soup.find('a')
# v = tag.wrap(soup.find('p'))
# print(soup)

26. unwrap,去掉当前标签,将保留其包裹的标签

# tag = soup.find('a')
# v = tag.unwrap()
# print(soup)

更多参数官方:http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

一大波"自动登陆"示例

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import requests # ############## 方式一 ##############
"""
# ## 1、首先登陆任何页面,获取cookie
i1 = requests.get(url="http://dig.chouti.com/help/service")
i1_cookies = i1.cookies.get_dict() # ## 2、用户登陆,携带上一次的cookie,后台对cookie中的 gpsd 进行授权
i2 = requests.post(
url="http://dig.chouti.com/login",
data={
'phone': "8615131255089",
'password': "xxooxxoo",
'oneMonth': ""
},
cookies=i1_cookies
) # ## 3、点赞(只需要携带已经被授权的gpsd即可)
gpsd = i1_cookies['gpsd']
i3 = requests.post(
url="http://dig.chouti.com/link/vote?linksId=8589523",
cookies={'gpsd': gpsd}
) print(i3.text)
""" # ############## 方式二 ##############
"""
import requests session = requests.Session()
i1 = session.get(url="http://dig.chouti.com/help/service")
i2 = session.post(
url="http://dig.chouti.com/login",
data={
'phone': "8615131255089",
'password': "xxooxxoo",
'oneMonth': ""
}
)
i3 = session.post(
url="http://dig.chouti.com/link/vote?linksId=8589523"
)
print(i3.text) """

抽屉新热榜

#!/usr/bin/env python
# -*- coding:utf-8 -*- import requests
from bs4 import BeautifulSoup # ############## 方式一 ##############
#
# # 1. 访问登陆页面,获取 authenticity_token
# i1 = requests.get('https://github.com/login')
# soup1 = BeautifulSoup(i1.text, features='lxml')
# tag = soup1.find(name='input', attrs={'name': 'authenticity_token'})
# authenticity_token = tag.get('value')
# c1 = i1.cookies.get_dict()
# i1.close()
#
# # 1. 携带authenticity_token和用户名密码等信息,发送用户验证
# form_data = {
# "authenticity_token": authenticity_token,
# "utf8": "",
# "commit": "Sign in",
# "login": "wupeiqi@live.com",
# 'password': 'xxoo'
# }
#
# i2 = requests.post('https://github.com/session', data=form_data, cookies=c1)
# c2 = i2.cookies.get_dict()
# c1.update(c2)
# i3 = requests.get('https://github.com/settings/repositories', cookies=c1)
#
# soup3 = BeautifulSoup(i3.text, features='lxml')
# list_group = soup3.find(name='div', class_='listgroup')
#
# from bs4.element import Tag
#
# for child in list_group.children:
# if isinstance(child, Tag):
# project_tag = child.find(name='a', class_='mr-1')
# size_tag = child.find(name='small')
# temp = "项目:%s(%s); 项目路径:%s" % (project_tag.get('href'), size_tag.string, project_tag.string, )
# print(temp) # ############## 方式二 ##############
# session = requests.Session()
# # 1. 访问登陆页面,获取 authenticity_token
# i1 = session.get('https://github.com/login')
# soup1 = BeautifulSoup(i1.text, features='lxml')
# tag = soup1.find(name='input', attrs={'name': 'authenticity_token'})
# authenticity_token = tag.get('value')
# c1 = i1.cookies.get_dict()
# i1.close()
#
# # 1. 携带authenticity_token和用户名密码等信息,发送用户验证
# form_data = {
# "authenticity_token": authenticity_token,
# "utf8": "",
# "commit": "Sign in",
# "login": "wupeiqi@live.com",
# 'password': 'xxoo'
# }
#
# i2 = session.post('https://github.com/session', data=form_data)
# c2 = i2.cookies.get_dict()
# c1.update(c2)
# i3 = session.get('https://github.com/settings/repositories')
#
# soup3 = BeautifulSoup(i3.text, features='lxml')
# list_group = soup3.find(name='div', class_='listgroup')
#
# from bs4.element import Tag
#
# for child in list_group.children:
# if isinstance(child, Tag):
# project_tag = child.find(name='a', class_='mr-1')
# size_tag = child.find(name='small')
# temp = "项目:%s(%s); 项目路径:%s" % (project_tag.get('href'), size_tag.string, project_tag.string, )
# print(temp)

github

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import time import requests
from bs4 import BeautifulSoup session = requests.Session() i1 = session.get(
url='https://www.zhihu.com/#signin',
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
}
) soup1 = BeautifulSoup(i1.text, 'lxml')
xsrf_tag = soup1.find(name='input', attrs={'name': '_xsrf'})
xsrf = xsrf_tag.get('value') current_time = time.time()
i2 = session.get(
url='https://www.zhihu.com/captcha.gif',
params={'r': current_time, 'type': 'login'},
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
}) with open('zhihu.gif', 'wb') as f:
f.write(i2.content) captcha = input('请打开zhihu.gif文件,查看并输入验证码:')
form_data = {
"_xsrf": xsrf,
'password': 'xxooxxoo',
"captcha": 'captcha',
'email': '424662508@qq.com'
}
i3 = session.post(
url='https://www.zhihu.com/login/email',
data=form_data,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
}
) i4 = session.get(
url='https://www.zhihu.com/settings/profile',
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
}
) soup4 = BeautifulSoup(i4.text, 'lxml')
tag = soup4.find(id='rename-section')
nick_name = tag.find('span',class_='name').string
print(nick_name)

知乎

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import re
import json
import base64 import rsa
import requests def js_encrypt(text):
b64der = 'MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCp0wHYbg/NOPO3nzMD3dndwS0MccuMeXCHgVlGOoYyFwLdS24Im2e7YyhB0wrUsyYf0/nhzCzBK8ZC9eCWqd0aHbdgOQT6CuFQBMjbyGYvlVYU2ZP7kG9Ft6YV6oc9ambuO7nPZh+bvXH0zDKfi02prknrScAKC0XhadTHT3Al0QIDAQAB'
der = base64.standard_b64decode(b64der) pk = rsa.PublicKey.load_pkcs1_openssl_der(der)
v1 = rsa.encrypt(bytes(text, 'utf8'), pk)
value = base64.encodebytes(v1).replace(b'\n', b'')
value = value.decode('utf8') return value session = requests.Session() i1 = session.get('https://passport.cnblogs.com/user/signin')
rep = re.compile("'VerificationToken': '(.*)'")
v = re.search(rep, i1.text)
verification_token = v.group(1) form_data = {
'input1': js_encrypt('wptawy'),
'input2': js_encrypt('asdfasdf'),
'remember': False
} i2 = session.post(url='https://passport.cnblogs.com/user/signin',
data=json.dumps(form_data),
headers={
'Content-Type': 'application/json; charset=UTF-8',
'X-Requested-With': 'XMLHttpRequest',
'VerificationToken': verification_token}
) i3 = session.get(url='https://i.cnblogs.com/EditDiary.aspx') print(i3.text)

博客园

#!/usr/bin/env python
# -*- coding:utf-8 -*- import requests # 第一步:访问登陆页,拿到X_Anti_Forge_Token,X_Anti_Forge_Code
# 1、请求url:https://passport.lagou.com/login/login.html
# 2、请求方法:GET
# 3、请求头:
# User-agent
r1 = requests.get('https://passport.lagou.com/login/login.html',
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
},
) X_Anti_Forge_Token = re.findall("X_Anti_Forge_Token = '(.*?)'", r1.text, re.S)[0]
X_Anti_Forge_Code = re.findall("X_Anti_Forge_Code = '(.*?)'", r1.text, re.S)[0]
print(X_Anti_Forge_Token, X_Anti_Forge_Code)
# print(r1.cookies.get_dict())
# 第二步:登陆
# 1、请求url:https://passport.lagou.com/login/login.json
# 2、请求方法:POST
# 3、请求头:
# cookie
# User-agent
# Referer:https://passport.lagou.com/login/login.html
# X-Anit-Forge-Code:53165984
# X-Anit-Forge-Token:3b6a2f62-80f0-428b-8efb-ef72fc100d78
# X-Requested-With:XMLHttpRequest
# 4、请求体:
# isValidate:true
# username:15131252215
# password:ab18d270d7126ea65915c50288c22c0d
# request_form_verifyCode:''
# submit:''
r2 = requests.post(
'https://passport.lagou.com/login/login.json',
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
'Referer': 'https://passport.lagou.com/login/login.html',
'X-Anit-Forge-Code': X_Anti_Forge_Code,
'X-Anit-Forge-Token': X_Anti_Forge_Token,
'X-Requested-With': 'XMLHttpRequest'
},
data={
"isValidate": True,
'username': '',
'password': 'ab18d270d7126ea65915c50288c22c0d',
'request_form_verifyCode': '',
'submit': ''
},
cookies=r1.cookies.get_dict()
)
print(r2.text)

拉勾网

拉钩参考:猛击这里

爬虫之requests详解的更多相关文章

  1. python 3.x 爬虫基础---Urllib详解

    python 3.x 爬虫基础 python 3.x 爬虫基础---http headers详解 python 3.x 爬虫基础---Urllib详解 前言 爬虫也了解了一段时间了希望在半个月的时间内 ...

  2. 爬虫系列---selenium详解

    一 安装 pip install Selenium 二 安装驱动 chrome驱动文件:点击下载chromedriver (yueyu下载) 三 配置chromedrive的路径(仅添加环境变量即可) ...

  3. 爬虫之Scrapy详解

    性能相关 在编写爬虫时,性能的消耗主要在IO请求中,当单进程单线程模式下请求URL时必然会引起等待,从而使得请求整体变慢. import requests def fetch_async(url): ...

  4. Python-第三方库requests详解

    Requests 是用Python语言编写,基于 urllib,采用 Apache2 Licensed 开源协议的 HTTP 库.它比 urllib 更加方便,可以节约我们大量的工作,完全满足 HTT ...

  5. [转载]Python-第三方库requests详解

    Requests 是用Python语言编写,基于 urllib,采用 Apache2 Licensed 开源协议的 HTTP 库.它比 urllib 更加方便,可以节约我们大量的工作,完全满足 HTT ...

  6. python第三方库requests详解

    Requests 是用Python语言编写,基于 urllib,采用 Apache2 Licensed 开源协议的 HTTP 库.它比 urllib 更加方便,可以节约我们大量的工作,完全满足 HTT ...

  7. python爬虫scrapy项目详解(关注、持续更新)

    python爬虫scrapy项目(一) 爬取目标:腾讯招聘网站(起始url:https://hr.tencent.com/position.php?keywords=&tid=0&st ...

  8. 11.Python-第三方库requests详解(三)

    Response对象 使用requests方法后,会返回一个response对象,其存储了服务器响应的内容,如上实例中已经提到的 r.text.r.status_code……获取文本方式的响应体实例: ...

  9. 10.Python-第三方库requests详解(二)

    Requests 是用Python语言编写,基于 urllib,采用 Apache2 Licensed 开源协议的 HTTP 库.它比 urllib 更加方便,可以节约我们大量的工作,完全满足 HTT ...

随机推荐

  1. 详解ABBYY FineReader 12扫描亮度设置

    很多刚接触ABBYY FineReader 12的小伙伴可能出现过这样一个问题:在扫描过程中会显示一条消息以提示更改亮度设置.这是因为你 FineReader扫描设置中亮度未正确设置.下面小编就给小伙 ...

  2. 【转】细谈Redis和Memcached的区别

    Redis的作者Salvatore Sanfilippo曾经对这两种基于内存的数据存储系统进行过比较: Redis支持服务器端的数据操作:Redis相比Memcached来说,拥有更多的数据结构和并支 ...

  3. mysql执行SQL语句时报错:[Err] 3 - Error writing file '/tmp/MYP0G1B8' (Errcode: 28 - No space left on device)

    问题描述: 今天一同事在mysql中执行SQL语句的时候,报了/tmp空间不足的问题,报错如下: [SQL] SELECT f.prov as 字段1, MAX( CASE f.flag_name W ...

  4. nginx配置http协议和tcp协议配置文件案例

    注意 nginx 1.9版本之后才支持 tcp #user nobody;worker_processes 1; #error_log logs/error.log;#error_log logs/e ...

  5. 8 -- 深入使用Spring -- 8...2 管理Hibernate的SessionFactory

    8.8.2 管理Hibernate的SessionFactory 当通过Hibernate进行持久层访问时,必须先获得SessionFactory对象,它是单个数据库映射关系编译后的内存镜像.在大部分 ...

  6. 8 -- 深入使用Spring -- 1...两种后处理器

    8.1 两种后处理器 Spring框架提供了很好的扩展性,出了可以与各种第三方框架良好整合外,其IoC容器也允许开发者进行扩展,这种扩展甚至无须实现BeanFactor或ApplicationCont ...

  7. cocos2dx 3.0 scrollview 在android下面背景變綠色了

    在windows上面跑的是OK的,  在android下面跑的時候就變成這樣子了:

  8. GitLab 使用

    命令行界面的基本操作如下,Web界面的操作参考:https://www.cnblogs.com/pzk7788/p/10291378.html [root@localhost ~]$ gitlab-c ...

  9. Splash jsfunc() 方法

    jsfunc()方法可以直接调用 JavaScript 定义的方法,但是所调用的方法需要用双中括号包围,这相当于实现了 JavaScript 方法到 Lua 脚本的转换 function main(s ...

  10. [Command] wc

    wc 命令可以打印目标文件的换行.单词和字节数.其中换行数 = 总行数 - 1,单词数则按照空格分隔的英文单词数进行统计,也就是说连续的汉字(短语.句子)都视作一个单词. NAME wc - 打印每个 ...