FMM和BMM的python代码实现

FMM和BMM的编程实现，其实两个算法思路都挺简单，一个是从前取最大词长度的小分句，查找字典是否有该词，若无则分句去掉最后面一个字，再次查找，直至分句变成单词或者在字典中找到，并将其去除，然后重复上述步骤。BMM则是从后取分句，字典中不存在则分句最前去掉一个字，也是重复类似的步骤。

readCorpus.py

import sys

output = {}

with open('语料库.txt', mode='r', encoding='UTF-8') as f:

    for line in f.readlines():

        if line is not None:

            # 去除每行的换行符

            t_line = line.strip('\n')

            # 按空格分开每个词

            words = t_line.split(' ')

            for word in words:

                # 按/分开标记和词

                t_word = word.split('/')

                # 左方括号去除

                tf_word = t_word[0].split('[')

                if len(tf_word) == 2:

                    f_word = tf_word[1]

                else:

                    f_word = t_word[0]

                # 若在输出字典中，则value+1

                if f_word in output.keys():

                    output[f_word] = output[f_word]+1

                # 不在输出字典中则新建

                else:

                    output[f_word] = 1

            big_word1 = t_line.split('[')

            for i in range(1, len(big_word1)):

                big_word2 = big_word1[i].split(']')[0]

                words = big_word2.split(' ')

                big_word = ""

                for word in words:

                    # 按/分开标记和词

                    t_word = word.split('/')

                    big_word = big_word + t_word[0]

                # 若在输出字典中，则value+1

                if big_word in output.keys():

                    output[big_word] = output[big_word]+1

                # 不在输出字典中则新建

                else:

                    output[big_word] = 1

f.close()

with open('output.txt', mode='w', encoding='UTF-8') as f:

    while output:

        minNum = sys.maxsize

        minName = ""

        for key, values in output.items():

            if values < minNum:

                minNum = values

                minName = key

        f.write(minName+": "+str(minNum)+"\n")

        del output[minName]

f.close()

BMM.py

MAX_WORD = 19

word_list = []

ans_word = []

with open('output.txt', mode='r', encoding='UTF-8')as f:

    for line in f.readlines():

        if line is not None:

            word = line.split(':')

            word_list.append(word[0])

f.close()

#num = input("输入句子个数：")

#for i in range(int(num)):

while True:

    ans_word = []

    try:

        origin_sentence = input("输入：\n")

        while len(origin_sentence) != 0:

            len_word = MAX_WORD

            while len_word > 0:

                # 从后读取最大词长度的数据，若该数据在字典中，则存入数组，并将其去除

                if origin_sentence[-len_word:] in word_list:

                    ans_word.append(origin_sentence[-len_word:])

                    len_sentence = len(origin_sentence)

                    origin_sentence = origin_sentence[0:len_sentence-len_word]

                    break

                # 不在词典中，则从后取词长度-1

                else:

                    len_word = len_word - 1

            # 单词直接存入数组

            if len_word == 0:

                if origin_sentence[-1:] != ' ':

                    ans_word.append(origin_sentence[-1:])

                len_sentence = len(origin_sentence)

                origin_sentence = origin_sentence[0:len_sentence - 1]

        for j in range(len(ans_word)-1, -1, -1):

            print(ans_word[j] + '/', end='')

        print('\n')

    except (KeyboardInterrupt, EOFError):

        break

FMM.py

MAX_WORD = 19

word_list = []

with open('output.txt', mode='r', encoding='UTF-8')as f:

    for line in f.readlines():

        if line is not None:

            word = line.split(':')

            word_list.append(word[0])

f.close()

#num = input("输入句子个数：")

#for i in range(int(num)):

while True:

    try:

        origin_sentence = input("输入：\n")

        while len(origin_sentence) != 0:

            len_word = MAX_WORD

            while len_word > 0:

                # 读取前最大词长度数据，在数组中则输出，并将其去除

                if origin_sentence[0:len_word] in word_list:

                    print(origin_sentence[0:len_word]+'/', end='')

                    origin_sentence = origin_sentence[len_word:]

                    break

                # 不在字典中，则读取长度-1

                else:

                    len_word = len_word - 1

            # 为0则表示为单词，输出

            if len_word == 0:

                if origin_sentence[0] != ' ':

                    print(origin_sentence[0]+'/', end='')

                origin_sentence = origin_sentence[1:]

        print('\n')

    except (KeyboardInterrupt, EOFError):

        break

效果图

BMM.py（不含大粒度分词）

BMM.py（含大粒度分词）

FMM.py（不含大粒度分词）

FMM.py（含大粒度分词）

我们可以观察到含大粒度分词的情况将香港科技大学，北京航空航天大学等表意能力强的词分在了一起而不是拆开，更符合分词要求。

FMM和BMM的python代码实现的更多相关文章

可爱的豆子——使用Beans思想让Python代码更易维护
title: 可爱的豆子--使用Beans思想让Python代码更易维护 toc: false comments: true date: 2016-06-19 21:43:33 tags: [Pyth ...
if __name__== "__main__" 的意思(作用)python代码复用
if __name__== "__main__" 的意思(作用)python代码复用转自:大步's Blog http://www.dabu.info/if-__-name__ ...
Python 代码风格
1 原则在开始讨论Python社区所采用的具体标准或是由其他人推荐的建议之前,考虑一些总体原则非常重要. 请记住可读性标准的目标是提升可读性.这些规则存在的目的就是为了帮助人读写代码,而不是相反. ...
一行python代码实现树结构
树结构是一种抽象数据类型,在计算机科学领域有着非常广泛的应用.一颗树可以简单的表示为根, 左子树, 右子树. 而左子树和右子树又可以有自己的子树.这似乎是一种比较复杂的数据结构,那么真的能像我们在标题 ...
[Dynamic Language] 用Sphinx自动生成python代码注释文档
用Sphinx自动生成python代码注释文档 pip install -U sphinx 安装好了之后,对Python代码的文档,一般使用sphinx-apidoc来自动生成:查看帮助mac-abe ...
上传自己的Python代码到PyPI
一.需要准备的事情 1.当然是自己的Python代码包了: 2.注册PyPI的一个账号. 二.详细介绍 1.代码包的结构: application \application __init__.py m ...
如何在batch脚本中嵌入python代码
老板叫我帮他测一个命令在windows下消耗的时间,因为没有装windows那个啥工具包,没有timeit那个命令,于是想自己写一个,原理很简单: REM timeit.bat echo %TIME% ...
ROS系统python代码测试之rostest
ROS系统中提供了测试框架,可以实现python/c++代码的单元测试,python和C++通过不同的方式实现, 之后的两篇文档分别详细介绍各自的实现步骤,以及测试结果和覆盖率的获取. ROS系统中p ...
让计算机崩溃的python代码，求共同分析
在现在的异常机制处理的比较完善的编码系统里面,让计算机完全崩溃无法操作的代码还是不多的.今天就无意运行到这段python代码,运行完,计算机直接崩溃,任务管理器都无法调用,任何键都用不了,只能强行电源 ...

随机推荐

linux命令行翻页
在linux上面执行命令,若命令太多屏幕显示不完,通过Shift+pageup/pageDown来查看. putty连接linux后执行就不存在这个问题.
nginx rewrite不支持if 嵌套也不支持逻辑或和逻辑并
如题,apache的rewrite是支持或者的,用个OR就可以,如果不加OR,多个RewriteCond 罗列累加就是并且的意思.然后nginx的rewrite就没有这么好了.那么如何去实现这样复杂的 ...
Android源码解析系列
转载请标明出处:一片枫叶的专栏知乎上看了一篇非常不错的博文:有没有必要阅读Android源码看完之后痛定思过,平时所学往往是知其然然不知其所以然,所以为了更好的深入Android体系,决定学习an ...
新人补钙系列教程之：AS3事件处理－－事件流
一个flash应用程序可能会非常复杂,比如,有很多可视实例嵌套在一起,这样的话会形成一个树形结构,这个结构的根是stage,然后一级级到不同的实例,一般来说,要把这个树形结构倒过来看,即stage在顶 ...
5.全局异常捕捉【从零开始学Spring Boot】
在一个项目中的异常我们我们都会统一进行处理的,那么如何进行统一进行处理呢? 新建一个类GlobalDefaultExceptionHandler, 在class注解上@ControllerAdvice ...
Android源代码下载
清华大学AOSP镜像: https://mirrors.tuna.tsinghua.edu.cn/help/AOSP/
Angular 学习笔记——ng-disable
<!DOCTYPE html> <html lang="en" ng-app="myApp"> <head> <met ...
Unity3D的三种坐标系
来自:http://blog.csdn.net/luxiaoyu_sdc/article/details/13168497 1, World Space(世界坐标): 我们在场景中添加物体(如:Cub ...
如何在Django1.6结合Python3.3版本中使用MySql
用起了Python3.4跟Django1.6,数据库依然是互联网企业常见的MySql. 悲催的是在Python2.7时代连接MySql的MySQLdb还不支持Python3.4,还好,苦苦追问G哥终于 ...
Eclipse 使用 SVN 插件后改动用户方法汇总
判定 SVN 插件是哪个 JavaH 的处理方法 SVNKit 的处理方法工具自带改动功能删除缓存的秘钥文件其他发表地点判定 SVN 插件是哪个常见的 Eclipse SVN 插件我知道的一 ...

FMM和BMM的python代码实现

FMM和BMM的python代码实现

FMM和BMM的python代码实现的更多相关文章

随机推荐

热门专题