利用BLEU进行机器翻译检测(Python-NLTK-BLEU评分方法)
双语评估替换分数(简称BLEU)是一种对生成语句进行评估的指标。完美匹配的得分为1.0,而完全不匹配则得分为0.0。这种评分标准是为了评估自动机器翻译系统的预测结果而开发的,具备了以下一些优点:
- 计算速度快,计算成本低。
- 容易理解。
- 与具体语言无关。
- 已被广泛采用。
BLEU评分是由Kishore Papineni等人在他们2002年的论文BLEU a Method for Automatic Evaluation of Machine Translation中提出的。BLEU计算的原理是计算待评价译文和一个或多个参考译文间的距离。距离是文本间n元相似度的平均,n=1,2,3(更高的值似乎无关紧要)。也就是说,如果待选译文和参考译文的2元(连续词对)或3元相似度较高,那么该译文的得分就较高。
我们是翻译众包业务,对于我们的应用场景,如何得知译员是否有参考机器翻译引擎就成了一个比较重要的问题。我提出的基本思路是:
- 在多个翻译网站上翻译原文,得到一组机器翻译评测集,以下的例子中就是一段原文通过百度、有道翻译之后,组织了一个机器翻译评测集
- 将译员翻译出来的译文,作为待评测数据,计算其与机器翻译评测集的BLEU值(使用NLTK中提供的BLEU评分方法)
- 值越高,表明匹配度越高,则译员参考机器翻译或者直接拷贝机器翻译的可能性就越高,此时需要项目经理介入。
以下是示例:
1、原文
新译星将代表四达时代集团在展览会上闪亮登场,届时我们将从新译星所开展的业务、具备的优势、成功案例等多个维度进行介绍,让您更加全面的了解新译星。我们拥有稳定的全职国际化团队,能够确保守时、高效的完成翻译和配音,并通过至臻完善的质量控制和项目管理体系进行全方位把控,提供翻译、配音、字幕制作、后期制作、播出以及收视率调查等一条龙服务。
2、人工翻译
New Transtar will present itself at the Exhibition on behalf of StarTimes, and we will give a comprehensive introduction of ourselves, including the current services we offer, the advantages we hold, and the projects we have completed, to help you understand us more. New Transtar boasts of an international team of professionals and is capable of providing fast and quality-guaranteed services including translating, dubbing, subtitle making, post-production, broadcasting and collecting of viewership ratings, thanks to our strict, streamlined and developed quality control and project management system.
3、百度翻译
The new translator will stand on the exhibition on behalf of the four times group at the exhibition. We will introduce the new star's business, the advantages and the successful cases, so that you can understand the new translator more comprehensively. We have a stable full-time international team that ensures punctual, efficient translation and dubbing, and provides a full range of control through the perfect quality control and project management system, providing a one-stop service for translation, dubbing, subtitle production, post production, broadcasting, and ratings surveys.
4、有道翻译
The new translator star will represent sida times group in the exhibition, when we will introduce the new translator star's business, advantages, successful cases and other dimensions, so that you can have a more comprehensive understanding of the new translator star. We have a stable full-time international team, which can ensure timely and efficient translation and dubbing. Through perfect quality control and project management system, we provide translation, dubbing, subtitle production, post-production, broadcasting and rating survey.
5、用百度翻译和有道翻译组织机器翻译评测集
[['The', 'new', 'translator', 'will', 'stand', 'on', 'the', 'exhibition', 'on', 'behalf', 'of', 'the', 'four', 'times', 'group', 'at', 'the', 'exhibition', 'We', 'will', 'introduce', 'the', 'new', 'star`s', 'business', 'the', 'advantages', 'and', 'the', 'successful', 'cases', 'so', 'that', 'you', 'can', 'understand', 'the', 'new', 'translator', 'more', 'comprehensively', 'We', 'have', 'a', 'stable', 'full-time', 'international', 'team', 'that', 'ensures', 'punctual', 'efficient', 'translation', 'and', 'dubbing', 'and', 'provides', 'a', 'full', 'range', 'of', 'control', 'through', 'the', 'perfect', 'quality', 'control', 'and', 'project', 'management', 'system', 'providing', 'a', 'one-stop', 'service', 'for', 'translation', 'dubbing', 'subtitle', 'production', 'post', 'production', 'broadcasting', 'and', 'ratings', 'surveys'],['The', 'new', 'translator', 'star', 'will', 'represent', 'sida', 'times', 'group', 'in', 'the', 'exhibition', 'when', 'we', 'will', 'introduce', 'the', 'new', 'translator', 'star`s', 'business', 'advantages', 'successful', 'cases', 'and', 'other', 'dimensions', 'so', 'that', 'you', 'can', 'have', 'a', 'more', 'comprehensive', 'understanding', 'of', 'the', 'new', 'translator', 'star', 'We', 'have', 'a', 'stable', 'full-time', 'international', 'team', 'which', 'can', 'ensure', 'timely', 'and', 'efficient', 'translation', 'and', 'dubbing', 'Through', 'perfect', 'quality', 'control', 'and', 'project', 'management', 'system', 'we', 'provide', 'translation', 'dubbing', 'subtitle', 'production', 'post-production', 'broadcasting', 'and', 'rating', 'survey']]
6、用人工翻译组织待检测数据
['New', 'Transtar', 'will', 'present', 'itself', 'at', 'the', 'Exhibition', 'on', 'behalf', 'of', 'StarTimes', 'and', 'we', 'will', 'give', 'a', 'comprehensive', 'introduction', 'of', 'ourselves', 'including', 'the', 'current', 'services', 'we', 'offer', 'the', 'advantages', 'we', 'hold', 'and', 'the', 'projects', 'we', 'have', 'completed', 'to', 'help', 'you', 'understand', 'us', 'more', 'New', 'Transtar', 'boasts', 'of', 'an', 'international', 'team', 'of', 'professionals', 'and', 'is', 'capable', 'of', 'providing', 'fast', 'and', 'quality-guaranteed', 'services', 'including', 'translating', 'dubbing', 'subtitle', 'making', 'post-production', 'broadcasting', 'and', 'collecting', 'of', 'viewership', 'ratings', 'thanks', 'to', 'our', 'strict', 'streamlined', 'and', 'developed', 'quality', 'control', 'and', 'project', 'management', 'system']
7、首先测试人工翻译产出的译文与机器翻译评测集之间的BLEU值,得到结果为0.119115465241,如下
[root@host---- ~]# python
Python 2.7. (default, Apr , ::)
[GCC 4.8. (Red Hat 4.8.-)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk.translate.bleu_score import sentence_bleu
>>>
>>> reference=[['The', 'new', 'translator', 'will', 'stand', 'on', 'the', 'exhibition', 'on', 'behalf', 'of', 'the', 'four', 'times', 'group', 'at', 'the', 'exhibition', 'We', 'will', 'introduce', 'the', 'new', 'star`s', 'business', 'the', 'advantages', 'and', 'the', 'successful', 'cases', 'so', 'that', 'you', 'can', 'understand', 'the', 'new', 'translator', 'more', 'comprehensively', 'We', 'have', 'a', 'stable', 'full-time', 'international', 'team', 'that', 'ensures', 'punctual', 'efficient', 'translation', 'and', 'dubbing', 'and', 'provides', 'a', 'full', 'range', 'of', 'control', 'through', 'the', 'perfect', 'quality', 'control', 'and', 'project', 'management', 'system', 'providing', 'a', 'one-stop', 'service', 'for', 'translation', 'dubbing', 'subtitle', 'production', 'post', 'production', 'broadcasting', 'and', 'ratings', 'surveys'],['The', 'new', 'translator', 'star', 'will', 'represent', 'sida', 'times', 'group', 'in', 'the', 'exhibition', 'when', 'we', 'will', 'introduce', 'the', 'new', 'translator', 'star`s', 'business', 'advantages', 'successful', 'cases', 'and', 'other', 'dimensions', 'so', 'that', 'you', 'can', 'have', 'a', 'more', 'comprehensive', 'understanding', 'of', 'the', 'new', 'translator', 'star', 'We', 'have', 'a', 'stable', 'full-time', 'international', 'team', 'which', 'can', 'ensure', 'timely', 'and', 'efficient', 'translation', 'and', 'dubbing', 'Through', 'perfect', 'quality', 'control', 'and', 'project', 'management', 'system', 'we', 'provide', 'translation', 'dubbing', 'subtitle', 'production', 'post-production', 'broadcasting', 'and', 'rating', 'survey']]
>>>
>>> candidate=['New', 'Transtar', 'will', 'present', 'itself', 'at', 'the', 'Exhibition', 'on', 'behalf', 'of', 'StarTimes', 'and', 'we', 'will', 'give', 'a', 'comprehensive', 'introduction', 'of', 'ourselves', 'including', 'the', 'current', 'services', 'we', 'offer', 'the', 'advantages', 'we', 'hold', 'and', 'the', 'projects', 'we', 'have', 'completed', 'to', 'help', 'you', 'understand', 'us', 'more', 'New', 'Transtar', 'boasts', 'of', 'an', 'international', 'team', 'of', 'professionals', 'and', 'is', 'capable', 'of', 'providing', 'fast', 'and', 'quality-guaranteed', 'services', 'including', 'translating', 'dubbing', 'subtitle', 'making', 'post-production', 'broadcasting', 'and', 'collecting', 'of', 'viewership', 'ratings', 'thanks', 'to', 'our', 'strict', 'streamlined', 'and', 'developed', 'quality', 'control', 'and', 'project', 'management', 'system']
>>>
>>> score = sentence_bleu(reference, candidate)
>>> print score
0.119115465241
>>>
8、其次我们稍微改动以下百度翻译出来的译文,并测试其与机器翻译评测集之间的BLEU值,得到结果0.875629670466,如下:
8.1稍微改动之后的百度翻译
New Transtar will stand on the exhibition on behalf of the four times group at the exhibition. We will introduce the new star's business, the advantages and the successful cases, so that you can understand the new translator more comprehensively. We have a stable full-time international team that ensures punctual, efficient translation and dubbing, and provides a full range of control through the perfect quality control and project management system, providing a one-stop service for translation, dubbing, subtitle production, streamlined and developed quality control and project management system.
8.2用改动之后的百度翻译作为待评测数据
['New', 'Transtar', 'will', 'stand', 'on', 'the', 'exhibition', 'on', 'behalf', 'of', 'the', 'four', 'times', 'group', 'at', 'the', 'exhibition', 'We', 'will', 'introduce', 'the', 'new', 'star`s', 'business', 'the', 'advantages', 'and', 'the', 'successful', 'cases', 'so', 'that', 'you', 'can', 'understand', 'the', 'new', 'translator', 'more', 'comprehensively', 'We', 'have', 'a', 'stable', 'full-time', 'international', 'team', 'that', 'ensures', 'punctual', 'efficient', 'translation', 'and', 'dubbing', 'and', 'provides', 'a', 'full', 'range', 'of', 'control', 'through', 'the', 'perfect', 'quality', 'control', 'and', 'project', 'management', 'system', 'providing', 'a', 'one-stop', 'service', 'for', 'translation', 'dubbing', 'subtitle', 'production', 'streamlined', 'and', 'developed', 'quality', 'control', 'and', 'project', 'management', 'system']
8.3BLEU计算
>>> candidate_baidu=['New', 'Transtar', 'will', 'stand', 'on', 'the', 'exhibition', 'on', 'behalf', 'of', 'the', 'four', 'times', 'group', 'at', 'the', 'exhibition', 'We', 'will', 'introduce', 'the', 'new', 'star`s', 'business', 'the', 'advantages', 'and', 'the', 'successful', 'cases', 'so', 'that', 'you', 'can', 'understand', 'the', 'new', 'translator', 'more', 'comprehensively', 'We', 'have', 'a', 'stable', 'full-time', 'international', 'team', 'that', 'ensures', 'punctual', 'efficient', 'translation', 'and', 'dubbing', 'and', 'provides', 'a', 'full', 'range', 'of', 'control', 'through', 'the', 'perfect', 'quality', 'control', 'and', 'project', 'management', 'system', 'providing', 'a', 'one-stop', 'service', 'for', 'translation', 'dubbing', 'subtitle', 'production', 'streamlined', 'and', 'developed', 'quality', 'control', 'and', 'project', 'management', 'system']
>>> score_baidu = sentence_bleu(reference, candidate_baidu)
>>> print score_baidu
0.875629670466
>>>
9、由上面示例可看到,当待评测译文非常接近(也就是说该译员参考了机器翻译或直接进行的拷贝)机器翻译评测集中的数据时,BLEU值会升高。不过至于高到什么程度才需要项目经理介入,这就需要在实际项目中不断的摸索了。
利用BLEU进行机器翻译检测(Python-NLTK-BLEU评分方法)的更多相关文章
- 机器翻译质量评测算法-BLEU
机器翻译领域常使用BLEU对翻译质量进行测试评测.我们可以先看wiki上对BLEU的定义. BLEU (Bilingual Evaluation Understudy) is an algorithm ...
- 【NLP】Python NLTK处理原始文本
Python NLTK 处理原始文本 作者:白宁超 2016年11月8日22:45:44 摘要:NLTK是由宾夕法尼亚大学计算机和信息科学使用python语言实现的一种自然语言工具包,其收集的大量公开 ...
- 【NLP】Python NLTK获取文本语料和词汇资源
Python NLTK 获取文本语料和词汇资源 作者:白宁超 2016年11月7日13:15:24 摘要:NLTK是由宾夕法尼亚大学计算机和信息科学使用python语言实现的一种自然语言工具包,其收集 ...
- 【NLP】干货!Python NLTK结合stanford NLP工具包进行文本处理
干货!详述Python NLTK下如何使用stanford NLP工具包 作者:白宁超 2016年11月6日19:28:43 摘要:NLTK是由宾夕法尼亚大学计算机和信息科学使用python语言实现的 ...
- 【NLP】Python NLTK 走进大秦帝国
Python NLTK 走进大秦帝国 作者:白宁超 2016年10月17日18:54:10 摘要:NLTK是由宾夕法尼亚大学计算机和信息科学使用python语言实现的一种自然语言工具包,其收集的大量公 ...
- Python NLTK 自然语言处理入门与例程(转)
转 https://blog.csdn.net/hzp666/article/details/79373720 Python NLTK 自然语言处理入门与例程 在这篇文章中,我们将基于 Pyt ...
- [转]【NLP】干货!Python NLTK结合stanford NLP工具包进行文本处理 阅读目录
[NLP]干货!Python NLTK结合stanford NLP工具包进行文本处理 原贴: https://www.cnblogs.com/baiboy/p/nltk1.html 阅读目录 目 ...
- Python+NLTK自然语言处理学习(一):环境搭建
Python+NLTK自然语言处理学习(一):环境搭建 参考黄聪的博客地址:http://www.cnblogs.com/huangcong/archive/2011/08/29/2157437.ht ...
- 利用百度智能云结合Python体验图像识别(转载来自qylruirui)
https://blog.csdn.net/qylruirui/article/details/94992917 利用百度智能云结合Python体验图像识别只要注册了百度账号就可以轻松体验百度智能云中 ...
随机推荐
- HDU 2008 数值统计
题目链接:HDU 2008 Description 统计给定的n个数中,负数.零和正数的个数. Input 输入数据有多组,每组占一行,每行的第一个数是整数n(n<100),表示需要统计的数值的 ...
- scrapy流程
- 九、JSP入门(2)
day12 JSP指令 1 JSP指令概述 JSP指令的格式:<%@指令名 attr1=”” attr2=”” %>,一般都会把JSP指令放到JSP文件的最上方,但这不是必须的. JSP中 ...
- 数据结构-堆 Java实现
数据结构-堆 Java实现. 实现堆自动增长 /** * 数据结构-堆. 自动增长 * */ public class Heap<T extends Comparable> { priva ...
- 白盒测试实践-day01
一.任务进展情况 了解本次测试的主要任务,进行小组分工. 二.存在的问题 由于最近比较忙,任务来不及做. 三.解决方法 坚持一下.
- NoSuchMethodError 问题
最近maven升级到gradle后,总是报NoSuchMethod error.然后 ,报错的类确实是有这个方法,一切看起来都没有问题.那么运行时jvm到底加载的哪里的类呢?有没有相关的命令可以查询, ...
- linux挂载系统ios文件与gcc安装
挂载方法: 1.将iso文件拷贝到某一目录下,(/test) 2.建立挂载点文件夹:mkdir /mnt/iso1 3.进入 mount –o loop /test/**.iso /mnt/is ...
- Zookeeper运维问题集锦
实际工作中用到Zookeeper集群的地方很多, 也碰到过各种各样的问题, 在这里作个收集整理, 后续会一直补充; 其中很多问题的原因, 解决方案都是google而来, 这里只是作次搬运工; 其实很多 ...
- nodejs 使用superagent+cheerio+eventproxy爬取豆瓣帖子
//cnpm install superagent cheerio eventproxy fs pathvar superagent = require('superagent'); var chee ...
- 原生js实现table表格列宽自由缩放
<!DOCTYPE html> <html> <head> <meta charset="gbk"> <title>ta ...