BeautifulSoup模板简单应用-提取html指定数据（api_name/api_method/api

from bs4 import BeautifulSoup

import re

import os.path

import itertools

name='newcrm'

source_file_path='./'+name+'.html'

def get_apiInfo():

    with open(source_file_path,encoding='utf-8') as api_file:

        fileInfo=api_file.read()

        soup = BeautifulSoup(fileInfo,'lxml')  #这里没有装lxml的话,把它去掉用默认的就好

        #匹配带有class属性的div标签

        divList = soup.find_all('div', attrs={'class': re.compile("api-one")})

        # print(len(divList))

        apiInfo_list=[]#存放所有的api_list

        try:

            for alltag in divList:

                h3List=alltag.find_all('h3')#在每个div标签下,查找所有的h3标签（包含接口名称）

                pList = alltag.find_all('p')  #在每个div标签下,查找所有的p标签(包含请求url与请求方法)

                tableList=alltag.find_all("table")#匹配div下所有线程的table标签(取出header字段/body字段)

                dictInfo={}

                api_descride_name_list,api_descride_example_list,api_descride_type_list=[],[],[]

                '''所有的接口描述信息：名称/描述/类型'''

                query_Param_name_list,query_Param_type_list,query_Param_IsNeed_list=[],[],[]

                query_Param_describe_list,query_Param_example_list=[],[]

                '''所有的query参数信息:名称/类型/是否必填/描述,主要用于get请求'''

                body_Param_name_list,body_Param_type_list,body_Param_IsNeed_list=[],[],[]

                body_Param_describe_list,body_Param_example_list=[],[]

                '''所有的body参数信息:名称/类型/是否必填/描述，主要用于post/put/delete请求'''

                header_Param_name_list,header_Param_type_list,header_Param_IsNeed_list=[],[],[]

                header_Param_describe_list,header_Param_example_list=[],[]

                '''所有的header参数信息:名称/类型/是否必填/描述'''

                # if divList.index(alltag)==len(divList):

                if not len(pList)<3:#如果请求方式/请求url为空则不执行后面的程序

                    api_name=h3List[0].string.strip()#匹配h3标签下的接口名称

                    api_method=pList[1].string.strip()#这里提取请求方式

                    api_path=pList[2].string.strip() #这里提取url,并去除空格

                    dictInfo['api_name']=api_name

                    dictInfo['api_method']=api_method

                    dictInfo['api_path']=api_path

                    header_list=[]#存放header值

                    body_list=[]#存放body值

                    query_list=[]#存放query值

                    api_descride_list=[]#存放所有接口描述信息

                    # print(api_name)

                    # print(api_method)

                    # print(api_path)

                    IsEmpty_str='Null_Str'#处理匹配为空时的字符串占位符

                    if not tableList in([],None):

                        for table in tableList:

                            table_title=table.tr.th.span.string#匹配table表头

                            table_thead=table.find_all('thead')#匹配table里的所有thead标签：字段title（描述）

                            table_tbody=table.find_all('tbody')#匹配字段value

                            tbody_temp_list1,tbody_temp_list2=[],[]#临时存放接口描述数据

                            for tbody in table_tbody:

                                tbody_key_type=tbody.find_all('th')#匹配出字段的类型

                                tbody_key_field=tbody.find_all('td')#匹配出字段的value

                                for tbody_name in tbody_key_field :#取出所有字段

                                    if not tbody_name.string is None:

                                        tbody_key=tbody_name.string.strip()

                                    else:

                                        tbody_name=IsEmpty_str#处理为空的情况

                                        tbody_key=tbody_name

                                        #将能够被5整除的字段放入body中(代表的是header/body)

                                    # print(tbody_name)

                                    if len(tbody_key_field)%5==0:

                                        if table_title in('Query参数名'):

                                            '''如果table的名称是Query，则添加到Query列表中'''

                                            query_list.append(tbody_key)

                                        elif table_title in('Body参数名'):

                                            '''如果table的名称是body，则添加到body列表中'''

                                            body_list.append(tbody_key)

                                        elif table_title in('Header参数名'):

                                            '''如果table的名称是header，则添加到header列表中'''

                                            header_list.append(tbody_key)#将能够被5整除的字段放入body中(代表的是header/body)

                                    elif len(tbody_key_field)%2==0 and table_title ==('参数名'):

                                        tbody_temp_list1.append(tbody_key)#剩下的放入参数名信息中，代表的是参数名描述

                                # print(header_list)

                                # print(body_list)

                                # print(tbody_temp_list1)

                                if not tbody_key_type==[]:

                                    for tbody_type in tbody_key_type:

                                        if type(tbody_type)!=(str,int):#只处理type为tag的

                                            if not (tbody_type.string is None) :

                                                tbody_field=tbody_type.string.strip()

                                                tbody_temp_list2.append(tbody_field)

                                            else:

                                                tbody_type=IsEmpty_str#处理为空的情况

                                                tbody_temp_list2.append(tbody_type)

                                    # print(tbody_temp_list2)

                                    temp_num=int(len(tbody_temp_list1)/len(tbody_temp_list2))

                                    #计算出两个list的对应关系,这里是2：1，

                                    #需要添加tbody_temp_list1key添加两次后在添加tbody_temp_list2的值

                                    for temp_tuple in enumerate(tbody_temp_list2):

                                        #将tbody_temp_list1/tbody_temp_list2的元素添加至api_descride_list

                                        #'''temp_tuple:是一个元组，第一个参数是index，第二个是value'''

                                        for num in range(temp_num):

                                        #因为tbody_temp_list1/tbody_temp_list2是多对1,

                                        #因为接口描述里的type是单独的th标签里的值，

                                        #而其他的值是td标签里面的值，所以需要将两个列表的元素进行合并

                                            api_descride_list.append(tbody_temp_list1[temp_tuple[0]*(temp_num)+num])

                                        else:

                                            api_descride_list.append(temp_tuple[1])

                                    # print(header_list)

                                    # print(body_list)

                                    # print(api_descride_list)

                                    #

                    if not header_list in([],None):#处理header列表中字段为空的情况

                        for n,m in enumerate(header_list):

                        #删除filed(不存在)的相关元素，这个是apizza导出的bug导致的

                        #因为列表存在多个相同的值，所以不能使用index方法来获取元素下标，

                        #使用enumerate来获取list的值与对应的下标

                            if m in("是","否") :#根据m的值找到对应的下表n

                                try:

                                    if header_list[(n+3)]==IsEmpty_str:

                                #如果n后面的第三个元素不是一个字段，而是事先定义的为空的字符串，则删除索引后面的五个元素

                                        for i in range(5):

                                            del header_list[n+3]

                                except IndexError as error:

                                    pass

                        for header_data in enumerate(header_list):#将列表中的值分别添加至各个列表

                            if header_data[0]==0 or header_data[0]%5==0 :

                                header_Param_name_list.append(header_data[1])

                            elif header_data[0]==1 or header_data[0]%5==1 :

                                header_Param_type_list.append(header_data[1])

                            elif header_data[0]==2 or header_data[0]%5==2:

                                header_Param_IsNeed_list.append(header_data[1])

                            elif header_data[0]==3 or header_data[0]%5==3:

                                header_Param_describe_list.append(header_data[1])

                            elif header_data[0]==4 or header_data[0]%5==4:

                                header_Param_example_list.append(header_data[1])

                    if not query_list in([],None):

                        for n,m in enumerate(query_list):#处理query列表中字段为空的情况

                        #因为列表可能存在多个相同的值，所以不能使用index方法来获取元素下标，

                        #使用enumerate来获取list的值与对应的下标

                            if m in("是","否") :#根据m的值找到对应的下表n

                                try:

                                    if query_list[(n+3)]==IsEmpty_str:

                                #如果n后面的第三个元素不是一个字段，而是事先定义的为空的字符串，

                                #则删除索引后面的五个元素

                                        for i in range(5):

                                            del query_list[n+3]

                                except IndexError as error:

                                    pass

                        for query_data in enumerate(query_list):#将列表中的值分别添加至各个列表

                            if query_data[0]==0 or query_data[0]%5==0 :

                                query_Param_name_list.append(query_data[1])

                            elif query_data[0]==1 or query_data[0]%5==1 :

                                query_Param_type_list.append(query_data[1])

                            elif query_data[0]==2 or query_data[0]%5==2:

                                query_Param_IsNeed_list.append(query_data[1])

                            elif query_data[0]==3 or query_data[0]%5==3:

                                query_Param_describe_list.append(query_data[1])

                            elif query_data[0]==4 or query_data[0]%5==4:

                                query_Param_example_list.append(query_data[1])

                    if not body_list in([],None):

                        for n,m in enumerate(body_list):#处理body列表中字段为空的情况

                        #因为列表可能存在多个相同的值，所以不能使用index方法来获取元素下标，

                        #使用enumerate来获取list的值与对应的下标

                            if m in("是","否") :#根据m的值找到对应的下表n

                                try:

                                    if body_list[(n+3)]==IsEmpty_str:

                                #如果n后面的第三个元素不是一个字段，而是事先定义的为空的字符串，

                                #则删除索引后面的五个元素

                                        for i in range(5):

                                            del body_list[n+3]

                                except IndexError as error:

                                    pass

                        for body_data in enumerate(body_list):#将列表中的值分别添加至各个列表

                            if body_data[0]==0 or body_data[0]%5==0 :

                                body_Param_name_list.append(body_data[1])

                            elif body_data[0]==1 or body_data[0]%5==1 :

                                body_Param_type_list.append(body_data[1])

                            elif body_data[0]==2 or body_data[0]%5==2:

                                body_Param_IsNeed_list.append(body_data[1])

                            elif body_data[0]==3 or body_data[0]%5==3:

                                body_Param_describe_list.append(body_data[1])

                            elif body_data[0]==4 or body_data[0]%5==4:

                                body_Param_example_list.append(body_data[1])

                    if not api_descride_list in([],None):#处理header列表中字段为空的情况

                        for api_descride_data in enumerate(api_descride_list):#将列表中的值分别添加至各个列表

                            if api_descride_data[0]==0 or api_descride_data[0]%3==0 :

                                api_descride_name_list.append(api_descride_data[1])

                            elif api_descride_data[0]==1 or api_descride_data[0]%3==1 :

                                api_descride_example_list.append(api_descride_data[1])

                            elif api_descride_data[0]==2 or api_descride_data[0]%3==2:

                                api_descride_type_list.append(api_descride_data[1])

                    if not header_list in(None,[]):

                        dictInfo['Header']={}

                        dictInfo['Header']['参数名']=header_Param_name_list

                        dictInfo['Header']['类型']=header_Param_type_list

                        dictInfo['Header']['必需']=header_Param_IsNeed_list

                        dictInfo['Header']['描述']=header_Param_describe_list

                        dictInfo['Header']['示例']=header_Param_example_list

                    if not query_list in(None,[]):

                        dictInfo['Query']={}

                        dictInfo['Query']['参数名']=query_Param_name_list

                        dictInfo['Query']['类型']=query_Param_type_list

                        dictInfo['Query']['必需']=query_Param_IsNeed_list

                        dictInfo['Query']['描述']=query_Param_describe_list

                        dictInfo['Query']['示例']=query_Param_example_list

                    if not body_list in(None,[]):

                        dictInfo['Body']={}

                        dictInfo['Body']['参数名']=body_Param_name_list

                        dictInfo['Body']['类型']=body_Param_type_list

                        dictInfo['Body']['必需']=body_Param_IsNeed_list

                        dictInfo['Body']['描述']=body_Param_describe_list

                        dictInfo['Body']['示例']=body_Param_example_list

                    if not api_descride_list in(None,[]):

                        dictInfo['接口说明']={}

                        dictInfo['接口说明']['参数名']=api_descride_name_list

                        dictInfo['接口说明']['描述']=api_descride_example_list

                        dictInfo['接口说明']['类型']=api_descride_type_list

                    # dictInfo['header_list']=header_list

                    # dictInfo['body_list']=body_list

                    # dictInfo['api_descride_list']=api_descride_list

                    apiInfo_list.append(dictInfo)

        except Exception as error:

            # pass

            raise(error)

        return apiInfo_list

def write_apiInfo_to_file():

    source_data=get_apiInfo()#获取数据来源

    print(len(source_data))

    file_path='./'+name+'-apiInfo.txt'

    with open(file_path,'w',encoding='utf-8') as file:

        file.write(str(source_data))

write_apiInfo_to_file()

BeautifulSoup模板简单应用-提取html指定数据（api_name/api_method/api_path,请求body/请求header/pagam参数）的更多相关文章

thinkPHP框架简单的删除和修改数据的做法和模板继承的意思大概做法
BiaodanController.class.php控制器页面 <?php namespace Admin\Controller; use think\Controller; class Bi ...
爬虫基础库之beautifulsoup的简单使用
beautifulsoup的简单使用简单来说,Beautiful Soup是python的一个库,最主要的功能是从网页抓取数据.官方解释如下: ''' Beautiful Soup提供一些简单的.p ...
sql server编写archive通用模板脚本实现自动分批删除数据
博主做过比较多项目的archive脚本编写,对于这种删除数据的脚本开发,肯定是一开始的话用最简单的一个delete语句,然后由于部分表数据量比较大啊,索引比较多啊,会发现删除数据很慢而且影响系统的正常 ...
简单一招实现json数据可视化
开发一个内部功能时碰到的需求,要把json数据在页面上展示出来,平时浏览器会安装jsonView这样的扩展来看json数据,但是程序要用到的话该怎么办呢?今天在网上搜索的时候,发现了这个小技巧,分享一 ...
Python使用Tabula提取PDF表格数据
今天遇到一个批量读取pdf文件中表格数据的需求,样式大体是以下这样: python读取PDF无非就是三种方式(我所了解的),pdfminer.pdf2htmlEX 和 Tabula.综合考虑后,选择了 ...
在AJAX里使用【 XML 】返回数据类型实现简单的下拉菜单数据
在AJAX里使用XML返回数据类型实现简单的下拉菜单数据 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN ...
在AJAX里使用【 JSON 】返回数据类型实现简单的下拉菜单数据
在AJAX里使用JSON返回数据类型实现简单的下拉菜单数据 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//E ...
mtail 提取应用日志数据到时序数据库的工具-支持prometheus
mtail 是谷歌开源的一款很不错的应用日志提取工具,我们可以方便的用来提取应用的数据到常见的监控系统(prometheus,stats,collectd,gragphite....) 说明: de ...
【视频编解码·学习笔记】4. H.264的码流封装格式 & 提取NAL有效数据
一.码流封装格式简单介绍: H.264的语法元素进行编码后,生成的输出数据都封装为NAL Unit进行传递,多个NAL Unit的数据组合在一起形成总的输出码流.对于不同的应用场景,NAL规定了一种通 ...

随机推荐

web动静分离
1 动态资源和静态资源动态资源:多次访问页面,原代码会发生改变,比如jsp 静态资源:多次访问页面,原代码不发生改变,比如html,css 2 动静分离将动态资源(jsp)放在tomcat服务器中 ...
Java开发笔记（一百零六）Fork+Join框架实现分而治之
前面依次介绍了普通线程池和定时器线程池的用法,这两种线程池有个共同点,就是线程池的内部线程之间并无什么关联,然而某些情况下的各线程间存在着前因后果关系.譬如人口普查工作,大家都知道我国总人口为14亿左 ...
Windows下Charles抓包https协议配置
最近设置https协议对手机app抓包遇到一些问题,现在在这里记录下,以防以后遇到问题没有记录 1.从官网下载Charles的安装包 https://www.charlesproxy.com/down ...
【scratch3.0教程】 2.3 奥运五环
(1)编程前的准备在设计一个作品之前,必须先策划一个脚本,然后再根据脚本,收集或制作素材(图案,声音等),接着就可以启动Scratch,汇入角色.舞台,利用搭程序积木的方式编辑程序,制作出符合脚本的 ...
pdfplumber模块初始用
import pdfplumber import re def pdf_read(): pdf=pdfplumber.open('文件路径'")#文件路径,读取文件 page0=pdf.pa ...
Mysql中HAVING的相关使用方法
having字句可以让我们筛选分组之后的各种数据,where字句在聚合前先筛选记录,也就是说作用在group by和having字句前. 而having子句在聚合后对组记录进行筛选.我的理解就是真实表 ...
作为一个纯粹数据结构的 Redis Streams
来源:antirez 翻译:Kevin (公众号:中间件小哥) Redis 5 中引入了一个名为 Streams 的新的 Redis 数据结构,吸引了社区极大的兴趣.接下来,我会在社区里进行调查,同用 ...
Python 判断字符串是否包含中文
一.摘要使用 xlrd 模块打开带中文的excel文件时,会报错. FileNotFoundError: [Errno 2] No such file or directory: 'xx.xlsx' ...
创建包含CRUD操作的Web API接口-第一部
在这里,我们将创建一个新的Web API项目,它将使用实体框架实现Get,POST.PUT和DELETE方法来实现CRUD操作. 首先,在Visual Studio 2013 for Web expr ...
Windows server 2012 R2下安装sharepoint2013
• 安装windows server 2012 R2 系统,配置IP.系统打补丁,修改主机名.加域后重启.• 安装WEB服务器,勾选windows身份验证 • 安装应用程序服务器 • 安装.NET F ...

BeautifulSoup模板简单应用-提取html指定数据（api_name/api_method/api_path,请求body/请求header/pagam参数）

BeautifulSoup模板简单应用-提取html指定数据（api_name/api_method/api_path,请求body/请求header/pagam参数）的更多相关文章

随机推荐

热门专题