Requests：是使用 Apache2 Licensed 许可证的基于Python开发的HTTP 库，其在Python内置模块的基础上进行了高度的封装，从而使得Pythoner进行网络请求时，变得美好了许多，使用Requests可以轻而易举的完成浏览器可有的任何操作。

BeautifulSoup：是一个模块，该模块用于接收一个HTML或XML字符串，然后将其进行格式化，之后遍可以使用他提供的方法进行快速查找指定元素，从而使得在HTML或XML中查找指定元素变得简单。

一：安装模块

pip3 install requests

pip3 install beautifulsoup4

二：requests和beautifulsoup4模块的简单联合使用

获取每条新闻的标题，标题链接和图片

import requests

from bs4 import BeautifulSoup

import uuid

reponse = requests.get(url="https://www.autohome.com.cn/news/")

reponse.encoding = reponse.apparent_encoding　　#获取文本原来编码，使两者编码一致才能正确显示

soup = BeautifulSoup(reponse.text,'html.parser')　　#使用的是html解析，一般使用lxml解析更好

target = soup.find(id="auto-channel-lazyload-article")　　#find根据属性去获取对象，id,attr,tag...自定义属性

li_list = target.find_all('li')　　#列表形式

for li in li_list:

    a_tag = li.find('a')

    if a_tag:

        href = a_tag.attrs.get("href")　　#属性是字典形式，使用get获取指定属性

        title = a_tag.find("h3").text　　#find获取的是对象含有标签，获取text

        img_src = "http:"+a_tag.find("img").attrs.get('src')

        print(href)

        print(title)

        print(img_src)

        img_reponse = requests.get(url=img_src)

        file_name = str(uuid.uuid4())+'.jpg'　　#设置一个不重复的图片名

        with open(file_name,'wb') as fp:

            fp.write(img_reponse.content)

总结使用：

（1）requests模块

reponse = requests.get(url)　　#根据url获取响应对象

reponse.apparent_encoding　　  #获取文本的原来编码

reponse.encoding　　　　　　　　 #对文本编码进行设置

reponse.text                  #获取文本内容，str类型

reponse.content　　　　　　　　  #获取数据，byte类型

reponse.status_code　　　　　　 #获取响应状态码

（2）beautifulsoup4模块

soup = BeautifulSoup('网页代码','html.parser')      　　　　 #获取HTML对象

target = soup.find(id="auto-channel-lazyload-article")    #根据自定义属性获取标签对象，默认找到第一个

li_list = target.find_all('li')    　　　　　　　　　　　　　　#根据标签名，获取所有的标签对象，放入列表中

注意：是自定义标签都可以查找

v1 = soup.find('div')

v1 = soup.find(id='il')

v1 = soup.find('div',id='i1')

find_all一样

对于获取的标签对象，我们可以使用

obj.text    　　　　获取文本

obj.attrs    　　  获取属性字典

三.requests模块详解

含有下面几种接口api,最后都会调用request方法，所以开始讨论request方法的详细使用。

def get(url, params=None, **kwargs):

    r"""Sends a GET request.

    :param url: URL for the new :class:`Request` object.

    :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.

    :param \*\*kwargs: Optional arguments that ``request`` takes.

    :return: :class:`Response <Response>` object

    :rtype: requests.Response

    """

    kwargs.setdefault('allow_redirects', True)

    return request('get', url, params=params, **kwargs)

def options(url, **kwargs):

    r"""Sends an OPTIONS request.

    :param url: URL for the new :class:`Request` object.

    :param \*\*kwargs: Optional arguments that ``request`` takes.

    :return: :class:`Response <Response>` object

    :rtype: requests.Response

    """

    kwargs.setdefault('allow_redirects', True)

    return request('options', url, **kwargs)

def head(url, **kwargs):

    r"""Sends a HEAD request.

    :param url: URL for the new :class:`Request` object.

    :param \*\*kwargs: Optional arguments that ``request`` takes.

    :return: :class:`Response <Response>` object

    :rtype: requests.Response

    """

    kwargs.setdefault('allow_redirects', False)

    return request('head', url, **kwargs)

def post(url, data=None, json=None, **kwargs):

    r"""Sends a POST request.

    :param url: URL for the new :class:`Request` object.

    :param data: (optional) Dictionary (will be form-encoded), bytes, or file-like object to send in the body of the :class:`Request`.

    :param json: (optional) json data to send in the body of the :class:`Request`.

    :param \*\*kwargs: Optional arguments that ``request`` takes.

    :return: :class:`Response <Response>` object

    :rtype: requests.Response

    """

    return request('post', url, data=data, json=json, **kwargs)

def put(url, data=None, **kwargs):

    r"""Sends a PUT request.

    :param url: URL for the new :class:`Request` object.

    :param data: (optional) Dictionary (will be form-encoded), bytes, or file-like object to send in the body of the :class:`Request`.

    :param json: (optional) json data to send in the body of the :class:`Request`.

    :param \*\*kwargs: Optional arguments that ``request`` takes.

    :return: :class:`Response <Response>` object

    :rtype: requests.Response

    """

    return request('put', url, data=data, **kwargs)

def patch(url, data=None, **kwargs):

    r"""Sends a PATCH request.

    :param url: URL for the new :class:`Request` object.

    :param data: (optional) Dictionary (will be form-encoded), bytes, or file-like object to send in the body of the :class:`Request`.

    :param json: (optional) json data to send in the body of the :class:`Request`.

    :param \*\*kwargs: Optional arguments that ``request`` takes.

    :return: :class:`Response <Response>` object

    :rtype: requests.Response

    """

    return request('patch', url, data=data, **kwargs)

def delete(url, **kwargs):

    r"""Sends a DELETE request.

    :param url: URL for the new :class:`Request` object.

    :param \*\*kwargs: Optional arguments that ``request`` takes.

    :return: :class:`Response <Response>` object

    :rtype: requests.Response

    """

    return request('delete', url, **kwargs)

除request方法外的其他方法

from . import sessions

def request(method, url, **kwargs):

    """Constructs and sends a :class:`Request <Request>`.

    :param method: method for the new :class:`Request` object.

    :param url: URL for the new :class:`Request` object.

    :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.

    :param data: (optional) Dictionary or list of tuples ``[(key, value)]`` (will be form-encoded), bytes, or file-like object to send in the body of the :class:`Request`.

    :param json: (optional) json data to send in the body of the :class:`Request`.

    :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.

    :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.

    :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload.

        ``file-tuple`` can be a -tuple ``('filename', fileobj)``, -tuple ``('filename', fileobj, 'content_type')``

        or a -tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string

        defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers

        to add for the file.

    :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.

    :param timeout: (optional) How many seconds to wait for the server to send data

        before giving up, as a float, or a :ref:`(connect timeout, read

        timeout) <timeouts>` tuple.

    :type timeout: float or tuple

    :param allow_redirects: (optional) Boolean. Enable/disable GET/OPTIONS/POST/PUT/PATCH/DELETE/HEAD redirection. Defaults to ``True``.

    :type allow_redirects: bool

    :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.

    :param verify: (optional) Either a boolean, in which case it controls whether we verify

            the server's TLS certificate, or a string, in which case it must be a path

            to a CA bundle to use. Defaults to ``True``.

    :param stream: (optional) if ``False``, the response content will be immediately downloaded.

    :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.

    :return: :class:`Response <Response>` object

    :rtype: requests.Response

    Usage::

      >>> import requests

      >>> req = requests.request('GET', 'http://httpbin.org/get')

      <Response []>

    """

    # By using the 'with' statement we are sure the session is closed, thus we

    # avoid leaving sockets open which can trigger a ResourceWarning in some

    # cases, and look like a memory leak in others.

    with sessions.Session() as session:

        return session.request(method=method, url=url, **kwargs)

参数介绍：

    :param method: 　提交方式get,post,put,patch,delete,options,head

    :param url: 　　 提交地址

    :param params: 　在URL中传递的参数 GET
　　　　　　　　　　　  request.request(method='GET',url='http://xxxx.com',params={'k1':'v1','k2':'v2'})
　　　　　　　　　　  　会自动装换为http://xxxx.com?k1=v1&k2=v2

    :param data:   　在请求体中传递的数据，字典，字节，文件对象 POST
　　　　　　　　　　  　request.request(method='GET',url='http://xxxx.com',data={'user':'aaaa','password':'bbb'})
　　　　　　　　　　  　虽然显示为字典形式，但是会在传递时也转换为data = "user=aaaa&password=bbbb"

    :param json:　　 存放在Django中请求体中的body中--->request.body中
　　　　　　　　　　　　request.request(method='GET',url='http://xxxx.com',json={'user':'aaaa','password':'bbb'})
　　　　　　　　　　　　会将json数据装换为字符串形式 json="{'user':'aaaa','password':'bbb'}",存放在请求体的body中
　　　　　　　　　　　　和data相比：data中只能存放基础类型，不能存放字典，列表等，二json只是将数据字符串化，所以可以存放这些数据类型

    :param headers: 请求头
　　　　　　　　　　　　可以用于防止别人使用脚本登录网站，例如上面抽屉自动登录就是根据请求头中用户代理，来过滤用户。也可以使用Referer看上一步网站位置，可以防止盗链等
　　　　　　　

    :param cookies: cookies,存放在请求头中，传递时是放在headers中传递过去的

    :param files: 　用于post方式传递文件时使用。使用键值对形式
　　　　　　　　　　　 request.post(usl='xxx',files={
　　　　　　　　　　　　　　'f1':open('s1.py','rb'),　　#传递的name:文件对象/文件内容 'f1':'dawfwafawfawf'
　　　　　　　　　　　　　　'f2':('newf1name',open('s1.py','rb')　　#元组中第一个参数，是上传到服务器中的文件名，可指定
　　　　　　　　　　　 })

    :param auth: 　权限验证，一般用于在web前端对数据进行加密base64加密。，一些网站在登录时，使用登录框输入用户密码后，在前端进行加密，然后将数据存放在请求头中
　　　　　　　　　　　ret = requests.get('https://api.github.com/user', 
　　　　　　　　　　　　　　　　　　　　　　auth=HTTPBasicAuth('用户名', '密码')
　　　　　　　　　　　)

    :param timeout: 超时float或者元组 一个参数时为float，代表等待服务器返回响应内容的时间，两个参数时为元组形式，第一个代表连接网站超时时间，第二个代表等待服务器响应的超时时间

　　　　　　　　　　　  ret = requests.get('http://google.com/', timeout=1)　
　　　　　　　　　　　　ret = requests.get('http://google.com/', timeout=(5, 1))

    :param allow_redirects: 允许重定向，类型为布尔型，默认为True，允许后，会去获取重定向后的页面数据进行返回
　　　　　　　　　　　　requests.get('http://127.0.0.1:8000/test/', allow_redirects=False)

    :param proxies: 代理，例如电脑出口IP（公网IP，非局域网）限制，以实现同IP操作限制。我们联系不同人员通过不同公网IP去操作，以实现解除操作限制，这些人员称为代理
　　　　　　　　　　　　技术上使用：代理服务器，我们向代理服务器上发送数据，让服务器替我们去选用代理IP去向指定的网站发送请求　　
　　　　　　　　　　　　request.post(
　　　　　　　　　　　　　　url = "http://dig.chouti.com/log",
　　　　　　　　　　　　　　data = form_data,
　　　　　　　　　　　　　　proxys = {
　　　　　　　　　　　　　　　　'http':'http://代理服务器地址:端口',
　　　　　　　　　　　　　　　　'https':'http://代理服务器地址:端口',
　　　　　　　　　　　　　　}
　　　　　　　　　　　　)

    :param stream: 流方式获取文件数据，下载一点数据到内存，就保存到磁盘中，每下载一点就保存一点。防止因为内存不足文件过大而不能完成下载任务情况
　　　　　　　　　　　from contextlib import closing
　　　　　　　　　　　with closing(requests.get('http://httpbin.org/get', stream=True)) as r:
　　　　　　　　　　　　　　# 在此处理响应。
　　　　　　　　　　　　　　for i in r.iter_content():
　　　　　　　　　　　　　　　　print(i)

    :param cert:　　带HTTPS时，通道进行ssl加密,原来http是使用socket，数据未加密，不安全。现在的HTTPS是含有加密解密过程。需要证书存在

　　　　　　　　　　　 一种是：自定义证书，客户端需要客户自己去安装证书
　　　　　　　　　　　　　　　request.get(
　　　　　　　　　　　　　　　　url="https:...",
　　　　　　　　　　　　　　　　cert="xxx.pem",　　#每次访问需要携带证书，格式是pem,('.crt','.key')<两个文件都需要携带，一起拼接加密>,两种文件验证方法
　　　　　　　　　　　　　　　)
　　　　　　　　　　　另一种是：在系统中带有的认证证书，需要去购买，厂家和系统联合，系统内置，直接对网站解析验证

    :param verify: 布尔类型，当为false时，忽略上面cert证书的存在，照样可以获取结果，一般网站为了用户便利，是允许这种情况

补充：request模块中session方法

对于上面的自动登录时，cookie和session等会话期间产生的数据需要我们自己手动管理。而session方法，会将请求获取的响应cookie和响应体等放入全局变量中，以后我们访问该网站时，会将这些数据自动携带一起发生过去。

注意：对于请求头我们自己还是需要去配置的

import requests

headers = {}

headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'

headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'

session = requests.session()

i1 = session.get("https://dig.chouti.com/",headers=headers)

i1.close()

form_data = {

    'phone':"xxxx",

    'password':"xxxx",

    'oneMonth':''

}

i2 = session.post(url="https://dig.chouti.com/login",data=form_data,headers=headers)

i3 = session.post("https://dig.chouti.com/link/vote?linksId=20324146",headers=headers)

print(i3.text)

{"result":{"code":"", "message":"推荐成功", "data":{"jid":"cdu_52941024478","likedTime":"","lvCount":"","nick":"山上有风景","uvCount":"","voteTime":"小于1分钟前"}}}

补充：Django中request对象（不是上面的request模块）

推文：django-request对象

无论我们发送什么样的格式，都会到request.body中，而request.post中可能没有值
依据的是请求头中的content-type来判断类型

例如： Content-Type: text/html;charset:utf-;

 常见的媒体格式类型如下：

    text/html ： HTML格式

    text/plain ：纯文本格式

    text/xml ：  XML格式

    image/gif ：gif图片格式

    image/jpeg ：jpg图片格式

    image/png：png图片格式

   以application开头的媒体格式类型：

   application/xhtml+xml ：XHTML格式

   application/xml     ： XML数据格式

   application/atom+xml  ：Atom XML聚合格式

   application/json    ： JSON数据格式

   application/pdf       ：pdf格式

   application/msword  ： Word文档格式

   application/octet-stream ： 二进制流数据（如常见的文件下载）

   application/x-www-form-urlencoded ： <form encType=””>中默认的encType，form表单数据被编码为key/value格式发送到服务器（表单默认的提交数据的格式）

   另外一种常见的媒体格式是上传文件之时使用的：

   multipart/form-data ： 需要在表单中进行文件上传时，就需要使用该格式

     以上就是我们在日常的开发中，经常会用到的若干content-type的内容格式。

　 例如：当我使用post传递数据，在服务端接收请求体，存放在request.body中，
　 然后到请求头中查询content-type：application/x-www-form-urlencoded
　 再将接收的请求体拷贝到request.post中存放

四.beautifulsoup4模块详解

标签的使用方法

HTML代码

from bs4 import BeautifulSoup

html = '''

<html lang="en">

<head>

    <meta charset="UTF-8">

    <title>Title</title>

</head>

<body>

    <a href="/wwewe/fafwaw" class="btn btn2">666daw6fw</a>

    <div id="content" for=''>

        <p>div>p

            <label>title</label>

        </p>

    </div>

    <hr/>

    <p id="bott">div,p</p>

</body>

</html>

'''

soup = BeautifulSoup(html,features="lxml")

1.name,标签名称

tag = soup.find("a")

print(tag.name) #a

tag = soup.find(id="content")

print(tag.name) #div

name

2.标签attrs属性的操作

tag = soup.find('a')

print(tag.attrs)    #{'href': '/wwewe/fafwaw', 'class': ['btn', 'btn2']}

print(tag.attrs['href'])    #/wwewe/fafwaw

tag.attrs['id']="btn-primary"   #添加

del tag.attrs['class']  #删除

tag.attrs['href']="/change"　　#改

print(tag.attrs)    #{'id': 'btn-primary', 'href': '/change'}

attrs

3.children所有子标签

body = soup.find("body")

print(body.children)    #list_iterator object,只会获取子标签，对于内部子孙标签是作为字符串形式存在

from bs4.element import Tag

for child in body.children:

    # print(type(child))

    # <class 'bs4.element.NavigableString'>字符串类型，一般是换行符，空格等

    # <class 'bs4.element.Tag'>子节点类型

    if type(child) == Tag:

        print(child)

children

4.descendants所有子孙标签

body = soup.find("body")

for child in body.descendants:  #会将内部子孙标签提出来，再次进行一次操作

    # print(type(child))

    # <class 'bs4.element.NavigableString'>字符串类型，一般是换行符，空格等

    # <class 'bs4.element.Tag'>子节点类型

    if type(child) == Tag:

        print(child)

descendants

5.clear，递归清空子标签，保留自己

body = soup.find("body")

body.clear()  #清空子标签,保留自己

print(soup) #body标签存在，内部为空

clear

6.decompose递归删除所有标签，包含自己

body = soup.find('body')

body.decompose()    #递归删除，包含自己

print(soup) #body标签不存在

7.extract，递归删除所有标签（同decompose）,获取删除的标签

body = soup.find('body')

deltag = body.extract() #递归删除，包含本标签

print(soup) #无body标签

print(deltag)   #是所有我们删除的标签

extract

8.decode，转化为字符串（含当前标签）；decode_contents（不含当前标签）

#用字符串形式输出，也可以直接输出，内置__str__方法

body = soup.find('body')

v = body.decode()   #含有当前标签

print(v)

v = body.decode_contents()  #不含当前标签

print(v)

decode decode_contents

9.encode,转换为字节（含当前标签）；encode_contents（不含当前标签）

#转换为字节类型

body = soup.find('body')

v = body.encode()      #含有body

print(v)

v = body.encode_contents()  #不含body

print(v)

encode encode_contents

10.find的灵活使用：按照标签名，属性，文本，recursive是否递归查找

tag = soup.find(name="p")    #默认是寻找所有子孙节点的数据,递归查找

print(tag)  #找到子孙下的第一个

tag = soup.find(name='p',recursive=False)

print(tag)  #None   是因为，当前标签是html标签，而不是body

tag = soup.find('body').find('p')

print(tag)  ##找到子孙下的第一个

tag = soup.find('body').find('p',recursive=False)

print(tag)  #<p>div,p</p>

tag = soup.find('body').find('div',attrs={"id":"content","for":""},recursive=False)

print(tag)  #找到该标签

find

11.find_all的灵活使用：标签名，属性，文本，正则，函数，limit，recursive查找

tags = soup.find_all('p')

print(tags)

tags = soup.find_all('p',limit=)   #只去获取一个，但是返回还是列表

print(tags)

tags = soup.find_all('p',attrs={'id':"bott"}) #按属性查找

print(tags)

tags = soup.find_all(name=['p','a'])    #查找所有p,a标签

print(tags)

tags = soup.find("body").find_all(name=['p','a'],recursive=False)    #查找所有p,a标签,只找子标签

print(tags)

tags = soup.find("body").find_all(name=['p','a'],text="div,p")  #查找所有文本时div,p的标签

print(tags)

正则匹配：

import re

pat = re.compile("p")

tags = soup.find_all(name=pat)

print(tags)

pat = re.compile("^lab")    #查找所有以lab开头的标签

tags = soup.find_all(name=pat)

print(tags)

pat = re.compile(".*faf.*")

tags = soup.find_all(attrs={"href":pat})    #或者直接href=pat

print(tags)

pat = re.compile("cont.*")

tags = soup.find_all(id=pat)

print(tags)

函数匹配：

def func(tag):

    return tag.has_attr("class") and tag.has_attr("href")

tags = soup.find_all(name=func)

print(tags)

find_all

12.标签属性的获取get，判断has_attr

tag = soup.find('a')

print(tag.get("href"))  #获取标签属性

print(tag.attrs.get("href"))  #获取标签属性

print(tag.has_attr("href"))

has_attr

13.标签文本的获取get_text,string和修改string

tag = soup.find(id='content')

print(tag.get_text())   #获取标签的文本内容,会获取所有的子孙标签文本

tag = soup.find("label")

print(tag.get_text())   #title

print(tag.string)   #title

tag.string = "test"

print(tag.get_text())   #test

get_text string

14.index查看标签在其父标签中的索引位置

body = soup.find("body")

child_tag = body.find("div",recursive=False)

if child_tag:

    print(body.index(child_tag))    #必须是其子标签，不是子孙标签

index

15.is_empty_element判断是否是空标签，或者闭合标签

tag = soup.find('hr')

print(tag.is_empty_element) #判断是不是空标签，闭合标签

is_empty_element

16.当前标签的关联标签

tag.next

tag.next_element

tag.next_elements　　#会包含有字符串文本类型

tag.next_sibling　　#只获取标签对象Tag

tag.next_siblings

tag.previous

tag.previous_element

tag.previous_elements

tag.previous_sibling

tag.previous_siblings

tag.parent

tag.parents

tag = soup.find(id="content")

print(tag)

print(tag.next) #下一个元素，这里是换行符

print(tag.next_element) #下一个元素，这里是换行符

print(tag.next_elements)    #下面的所有子孙标签，都会提出来进行一次迭代

for ele in tag.next_elements:

    print(ele)

print(tag.next_sibling) #只去获取子标签

print(tag.next_siblings)    #只含有子标签

for ele in tag.next_siblings:

    print(ele)

next_element next_sibling演示和区别

17.find_根据条件去操作当前标签的关联标签,使用方法和上面类似

tag.find_next(...)

tag.find_all_next(...)

tag.find_next_sibling(...)

tag.find_next_siblings(...)

tag.find_previous(...)

tag.find_all_previous(...)

tag.find_previous_sibling(...)

tag.find_previous_siblings(...)

tag.find_parent(...)

tag.find_parents(...)

tag = soup.find("label")

# print(tag.parent)

# for par in tag.parents:

#     print(par)

print(tag.find_parent(id='content'))    #根据条件去上面查找符合条件的一个标签

print(tag.find_parents(id='content'))   #根据条件去向上面查找所有符合条件的标签，列表形式

parent find_parent使用比较

18.select,select_one, CSS选择器

soup.select("title")

soup.select("p nth-of-type(3)")

soup.select("body a")

soup.select("html head title")

tag = soup.select("span,a")

soup.select("head > title")

soup.select("p > a")

soup.select("p > a:nth-of-type(2)")

soup.select("p > #link1")

soup.select("body > a")

soup.select("#link1 ~ .sister")

soup.select("#link1 + .sister")

soup.select(".sister")

soup.select("[class~=sister]")

soup.select("#link1")

soup.select("a#link2")

soup.select('a[href]')

soup.select('a[href="http://example.com/elsie"]')

soup.select('a[href^="http://example.com/"]')

soup.select('a[href$="tillie"]')

soup.select('a[href*=".com/el"]')

from bs4.element import Tag

def default_candidate_generator(tag):

    for child in tag.descendants:

        if not isinstance(child, Tag):

            continue

        if not child.has_attr('href'):

            continue

        yield child

tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator)

print(type(tags), tags)

from bs4.element import Tag

def default_candidate_generator(tag):

    for child in tag.descendants:

        if not isinstance(child, Tag):

            continue

        if not child.has_attr('href'):

            continue

        yield child

tags = soup.find('body').select("a", _candidate_generator=default_candidate_generator, limit=)

print(type(tags), tags)

select

19.Tag类新建标签

from bs4.element import Tag

tag_obj = Tag(name='pre',attrs={"col":})

tag_obj.string="这是一个新建标签"

print(tag_obj)  #<pre col="">这是一个新建标签</pre>

Tag()

20.append将新建标签，追加到内部标签，是放在最后面的（注意append可以将已存在的标签对象移动到另一个标签下面，原来的不存在了）

soup = BeautifulSoup(html,features="lxml")

from bs4.element import Tag

tag_obj = Tag(name='pre',attrs={"col":})

tag_obj.string="这是一个新建标签"

# print(tag_obj)  #<pre col="">这是一个新建标签</pre>

soup.find(id="content").append(tag_obj) #追加放在最后面

print(soup)

append

21.insert为当前标签内部指定位置插入标签

soup = BeautifulSoup(html,features="lxml")

from bs4.element import Tag

tag_obj = Tag(name='pre',attrs={"col":})

tag_obj.string="这是一个新建标签"

# print(tag_obj)  #<pre col="">这是一个新建标签</pre>

soup.find(id="content").insert(,tag_obj) #追加放在最前面

print(soup)

insert指定位置插入

22.insert_after,insert_before 在当前标签后面或前面插入

soup = BeautifulSoup(html,features="lxml")

from bs4.element import Tag

tag_obj = Tag(name='pre',attrs={"col":})

tag_obj.string="这是一个新建标签"

# print(tag_obj)  #<pre col="">这是一个新建标签</pre>

soup.find(id="content").insert_before(tag_obj) #放在当前标签前面

soup.find(id="content").insert_after(tag_obj) #放在当前标签后面

print(soup)

insert_before insert_after

23.replace_with 将当前标签替换为指定标签

soup = BeautifulSoup(html,features="lxml")

from bs4.element import Tag

tag_obj = Tag(name='pre',attrs={"col":})

tag_obj.string="这是一个新建标签"

# print(tag_obj)  #<pre col="">这是一个新建标签</pre>

soup.find(id="content").replace_with(tag_obj) #原来div标签被替换

print(soup)

replace_with

24. setup创建标签之间的关系（用途不明显，用途不大）

def setup(self, parent=None, previous_element=None, next_element=None,

          previous_sibling=None, next_sibling=None):

soup = BeautifulSoup(html,features="lxml")

div = soup.find('div')

a = soup.find('a')

div.setup(next_sibling=a)

print(soup) #没有变化

print(div.next_sibling) #是我们设置的那个标签对象

setup

25.wrap，用指定标签将当前标签包裹起来

soup = BeautifulSoup(html,features="lxml")

from bs4.element import Tag

tag_obj = Tag(name='pre',attrs={"col":})

tag_obj.string="这是一个新建标签"

a = soup.find("a")

a.wrap(tag_obj) #用新建标签将当前a标签包含起来

div = soup.find('div')

tag_obj.wrap(div)   #用原本存在的标签包含现在的tag_obj,包含数放在最后面

print(soup)

wrap是调用标签将自己包含

26.unwrap，去掉当前标签，将保留其包裹（内部）的标签

div = soup.find('div')

div.unwrap()

print(soup)

unwrap将外层的当前标签去掉

五：实现自动登录github网站

import requests

from bs4 import BeautifulSoup

html1 = requests.get(url="https://github.com/login")　　#先到登录页，获取token，cookies

html1.encoding = html1.apparent_encoding

soup = BeautifulSoup(html1.text,features="html.parser")

login_token_obj = soup.find(name='input', attrs={'name': 'authenticity_token'})

login_token = login_token_obj.get("value")　　#获取到页面的令牌

cookie_dict = html1.cookies.get_dict()

html1.close()


#填写form表单需要的数据

login_data = {　　

    'login':"账号",

    'password':"密码",

    'authenticity_token':login_token,

    "utf8": "",

    "commit":"Sign in"

}

session_reponse = requests.post("https://github.com/session",data=login_data,cookies=cookie_dict)　　#必须传入cookies

cookie_dict.update(session_reponse.cookies.get_dict())　　#更新网站的cookies

index_reponse = requests.get("https://github.com/settings/repositories",cookies=cookie_dict)　　#必须携带cookies

soup2 = BeautifulSoup(index_reponse.text,features="html.parser")　　#解析下面的列表数据，获取项目名和项目大小

item_list = soup2.find_all("div",{'class':'listgroup-item'})

for item in item_list:

    a_obj = item.find("a")

    s_obj = item.find('small')

    print(a_obj.text)

    print(s_obj.text)

六：实现自动登录抽屉新热榜，实现点赞

推文：为何大量网站不能抓取?爬虫突破封禁的6种常见方法

1.其中抽屉网防止直接被爬取数据，使用的是对请求头进行验证，所以我们需要修改请求头，防止被网站防火墙拦截

2.抽屉网，对于第一次传递的cookies中gpsd数据进行了授权，在我们后续的操作中需要的是第一次请求中的gpsd,我们若是使用了其他的请求中的cookie，那么会出错

import requests

headers = {}　　#设置请求头

headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'

headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'

i1 = requests.get("https://dig.chouti.com/",headers=headers)

i1_cookie = i1.cookies.get_dict()

print(i1_cookie)

i1.close()

form_data = {

    'phone':"xxxx",

    'password':"xxxx",

    'oneMonth':''

}

headers['Accept'] = '*/*'

i2 = requests.post(url="https://dig.chouti.com/login",headers=headers,data=form_data,cookies=i1_cookie)

i2_cookie = i2.cookies.get_dict()

i2_cookie.update(i1_cookie)

i3 = requests.post("https://dig.chouti.com/link/vote?linksId=20306326",headers=headers,cookies=i2_cookie)

print(i3.text)

{'JSESSIONID': 'aaaoJAuXMtUytb02Uw9pw', 'route': '0c5178ac241ad1c9437c2aafd89a0e50', 'gpsd': '91e20c26ddac51c60ce4ca8910fb5669'}

{"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_52941024478","likedTime":"1529420936883000","lvCount":"23","nick":"山上有风景","uvCount":"2","voteTime":"小于1分钟前"}}}

七：自动登录知乎

Python模拟登陆新版知乎（代码全）

模拟登陆改版后的知乎（讲解详细）

知乎改版使用restapi后模拟登录（讲了signature）

八：自动登录博客园

python---requests和beautifulsoup4模块的使用的更多相关文章

python 3.x 爬虫基础---常用第三方库（requests，BeautifulSoup4，selenium，lxml ）
python 3.x 爬虫基础 python 3.x 爬虫基础---http headers详解 python 3.x 爬虫基础---Urllib详解 python 3.x 爬虫基础---常用第三方库 ...
使用python requests模块搭建http load压测环境
网上开源的压力测试工具超级的多,但是总有一些功能不是很符合自己预期的,于是自己动手搭建了一个简单的http load的压测环境 1.首先从最简单的http环境着手,当你在浏览器上输入了http://w ...
python怎么安装requests、beautifulsoup4等第三方库
零基础学习python最大的难题之一就是安装所有需要的软件,下面来简单介绍一下如何安装用pip安装requests.beautifulsoup4等第三方库: 方法/步骤点击开始,在运行里 ...
Python:requests库、BeautifulSoup4库的基本使用（实现简单的网络爬虫）
Python:requests库.BeautifulSoup4库的基本使用(实现简单的网络爬虫) 一.requests库的基本使用 requests是python语言编写的简单易用的HTTP库,使用起 ...
Python 爬虫之 Beautifulsoup4，爬网站图片
安装: pip3 install beautifulsoup4 pip install beautifulsoup4 Beautifulsoup4 解析器使用 lxml,原因为,解析速度快,容错能力强 ...
Python（五）模块
本章内容: 模块介绍 time & datetime random os sys json & picle hashlib XML requests ConfigParser logg ...
使用pip安装BeautifulSoup4模块
1.测试是否安装了BeautifulSoup4模块 import bs4 print bs4 执行报错说明没有安装该模块 Traceback (most recent call last): File ...
requests+selenium==requestium模块介绍
有时,你可能会在网上实现一些自动化操作.比如抓取网站,进行应用测试,或在网上填表,但又不想使用API,这时自动化就变得很必要.Python提供了非常优秀的Requests库可以辅助进行这些操作.可惜, ...
转载：python + requests实现的接口自动化框架详细教程
转自https://my.oschina.net/u/3041656/blog/820023 摘要: python + requests实现的接口自动化框架详细教程前段时间由于公司测试方向的转型,由 ...

随机推荐

iOS学习资源搜集
swift 2.0 新的开始 iOS7初学者入门斯坦福大学公开课:iOS 8开发苹果官方开发中文 iOS/Mac 开发博客列表 git
C#设置代码只在调试模式下执行
获取一个值,它指示调试器是否已附加到进程. 命名空间:Namespace:System.Diagnostics if (Debugger.IsAttached) { Response.Write(&q ...
Mininet安装，简单实现一个网络拓扑结构
安装mininet Mininet安装教程,可以按照这个来,然而这个虚拟机有时会很难装.可以考虑如下的做法:先 git clone,cd mininet 和 cat INSTALL之后,可以在提示信息 ...
09_Java面向对象_第9天（类、封装）_讲义
今日内容介绍 1.面向对象思想 2.类与对象的关系 3.局部变量和成员变量的关系 4.封装思想 5.private,this关键字 6.随机点名器 01面向对象和面向过程的思想 A: 面向过程与面向对 ...
发布.NET Core到IIS
目录: 支持操作系统 IIS配置安装.NET Core Windows Server Hosting 部署应用程序在IIS配置网站创建一个数据保护注册表项常见的错误额外的资源支持操作系统 ...
/etc/tolmcat/Server.xml 实例说明
# 这是service类 <Service name="Catalina"> # 这是http连接器,响应用户请求 <Connector port=&qu ...
[转帖] dd 命令图解
dd命令-->dd是disk dump的缩写,指定大小的块拷贝一个文件,同时进行指定的转换,起到一个初始化磁盘的作用 https://blog.csdn.net/jerry_1126/arti ...
Git常用的几个命令
标签(空格分隔): Git 在本地文件系统中新建目录,放置你的工程: mk dir parkk cd parkk //进入该目录 git init //初始化自己的仓库,默认名称为master 在仓库 ...
Theme Section HDU - 4763（些许暴力）
题意: 求出最长公共前后缀不能重叠而且这个前后缀在串的中间也要出现一次解析: 再明确一次next数组的意思:完全匹配的最长前后缀长度求一遍next 然后暴力枚举就好了 #include ...
BZOJ1113 [Poi2008]海报PLA 【分治 + 线段树】
题目链接 BZOJ1113 题解显然只与高有关,每次选择所有海报中最低的覆盖所有海报,然后分治两边每个位置会被调用一次,复杂度\(O(nlogn)\) \(upd:\)智障了,,是一道\(O(n) ...

python---requests和beautifulsoup4模块的使用

Requests：是使用 Apache2 Licensed 许可证的 基于Python开发的HTTP 库，其在Python内置模块的基础上进行了高度的封装，从而使得Pythoner进行网络请求时，变得美好了许多，使用Requests可以轻而易举的完成浏览器可有的任何操作。