[文学阅读] METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

Satanjeev Banerjee Alon Lavie

Language Technologies Institute

Carnegie Mellon University

Pittsburgh, PA 15213

banerjee+@cs.cmu.edu alavie@cs.cmu.edu

Important Snippets:

1. In order to be both effective and useful, an automatic metric for MT evaluation has to satisfy several basic criteria. The primary and most intuitive requirement is that the metric have very high correlation with
quantified human notions of MT quality. Furthermore, a good metric should be as sensitive as possible to differences in MT quality between different systems, and between different versions of the same system. The metric should be

consistent (same MT system on similar texts should produce similar scores), reliable (MT systems that score similarly can be trusted to perform similarly) and general (applicable to different MT tasks in a wide range of domains and scenarios). Needless
to say, satisfying all of the above criteria is extremely difficult, and all of the metrics that have been proposed so far fall short of adequately addressing most if not all of these requirements.

2. It is based on an explicit word-to-word matching between the MT output being evaluated and one or more reference translations. Our current matching supports not only matching between words that are identical
in the two strings being compared, but can also match words that are simple morphological variants of each other

3. Each possible matching is scored based on a combination of several features. These currently include uni-gram-precision, uni-gram-recall, and a direct measure of how out-of-order the words of the MT output are with respect to
the reference.

4.Furthermore, our results demonstrated that recall plays a more important role than precision in obtaining high-levels of correlation with human judgments.

5.BLEU does not take recall into account directly.

6.BLEU does not use recall because the notion of recall is unclear when matching simultaneously against a set of reference translations (rather than a single reference). To compensate for recall, BLEU uses a Brevity
Penalty, which penalizes translations for being “too short”.

7.BLEU and NIST suffer from several weaknesses:

>The Lack of Recall

>Use of Higher Order N-grams

>Lack of Explicit Word-matching Between Translation and Reference

>Use of Geometric Averaging of N-grams

8.METEOR was designed to explicitly address the weaknesses in BLEU identified above. It evaluates a translation by computing a score based on explicit word-to-word matches between the translation and a reference
translation. If more than one reference translation is available, the given translation is scored against each reference independently, and the best score is reported.

9.Given a pair of translations to be compared (a system translation and a reference translation), METEOR creates an alignment between the two strings. We define an alignment as a mapping be-tween unigrams, such that
every unigram in each string maps to zero or one unigram in the other string, and to no unigrams in the same string.

10.This alignment is incrementally produced through a series of stages, each stage consisting of two distinct phases.

11.In the first phase an external module lists all the possible unigram mappings between the two strings.

12.Different modules map unigrams based on different criteria. The “exact” module maps two unigrams if they are exactly the same (e.g. “computers” maps to “computers” but not “computer”). The “porter stem”
module maps two unigrams if they are the same after they are stemmed using the Porter stemmer (e.g.: “com-puters” maps to both “computers” and to “com-puter”). The “WN synonymy” module maps two unigrams if they are synonyms of each
other.

13.In the second phase of each stage, the largest subset of these unigram mappings is selected such

that the resulting set constitutes an alignment as defined above

14. METEOR selects that set that has the least number of unigram mapping crosses.

15.By default the first stage uses the “exact” mapping module, the second the “porter stem” module and the third the “WN synonymy” module.

16. unigram precision (P)

unigram recall (R)

Fmean by combining the precision and recall via a harmonic-mean

watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvaWN0MjAxNA==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast" alt="">

To take into account longer matches, METEOR computes a penalty for a given alignment as follows.

chunks such that the uni-grams in each chunk are in adjacent positions in the system translation, and are also mapped to uni-grams that are in adjacent positions in the reference translation.

Conclusion: METEOR prefer recall to precision while BLEU is converse.Meanwhile, it incorporates many information.

[文学阅读] METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments的更多相关文章

(zhuan) Recurrent Neural Network
Recurrent Neural Network 2016年07月01日 Deep learning Deep learning 字数:24235 this blog from: http:/ ...
Paper Reading - Learning to Evaluate Image Captioning ( CVPR 2018 ) ★
Link of the Paper: https://arxiv.org/abs/1806.06422 Innovations: The authors propose a novel learnin ...
《30天学习30种新技术》-Day 15：Meteor —— 从零开始创建一个 Web 应用
目录:https://segmentfault.com/a/1190000000349384 原文: https://segmentfault.com/a/1190000000361440 到目前为止 ...
读书笔记——莫提默·J.艾德勒&查尔斯·范多伦（美）《如何阅读一本书》
第一篇阅读的层次第一章阅读的活力与艺术阅读的目标:娱乐.获得资讯.增进理解力这本书是为那些想把读书的主要目的当作是增进理解能力的人而写.何谓阅读艺术?这是一个凭借着头脑运作,除了玩味读物中的一 ...
如何阅读一本书——分析阅读Pre
如何阅读一本书--分析阅读Pre 前情介绍作者: 莫提默.艾德勒查尔斯.范多伦初版:1940年,一出版就是全美畅销书榜首一年多.钢铁侠Elon.Musk学过. 需要注意的句子: 成功的阅读牵涉到 ...
BLEU (Bilingual Evaluation Understudy)
什么是BLEU? BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text w ...
机器翻译质量评测算法-BLEU
机器翻译领域常使用BLEU对翻译质量进行测试评测.我们可以先看wiki上对BLEU的定义. BLEU (Bilingual Evaluation Understudy) is an algorithm ...
cvpr2015papers
@http://www-cs-faculty.stanford.edu/people/karpathy/cvpr2015papers/ CVPR 2015 papers (in nicer forma ...
{ICIP2014}{收录论文列表}
This article come from HEREARS-L1: Learning Tuesday 10:30–12:30; Oral Session; Room: Leonard de Vinc ...

随机推荐

【IOS实例小计】图像移动--可扩展为动态实现图标变化
预备知识: 1.页面切换: 从一个ViewController切换到另一个ViewController有下面几种方法: (1)self.view addSubview:(加载的新页面); 相 ...
[置顶] EasyMock构建单元测试
1. 背景单元测试作为程序的基本保障.很多时候构建测试场景是一件令人头疼的事.因为之前的单元测试都是内部代码引用的,环境自给自足.开发到了一定程度,你不得不到开始调用外部的接口来完成你的功能.而外部 ...
空间参考系统与WKT解析
空间参考系统与WKT解析 1.为什么要空间参考系统? 空间参考系统,也称为坐标系统.在GIS中为地理数据定位的基准,假设给你一个坐标(442281.875,4422651.589).如果不给你空间参考 ...
spring-framework-3.2.4.RELEASE 综合hibernate-release-4.3.5.Final一个错误Caused by: java.lang.NoClassDefFound
LZ一体化的今天spring-framework-3.2.4.RELEASE 综合hibernate-release-4.3.5.Final一个错误Caused by: java.lang.NoCla ...
8张图理解Java(转)
一图胜千言,下面图解均来自Program Creek 网站的Java教程,目前它们拥有最多的票选.如果图解没有阐明问题,那么你可以借助它的标题来一窥究竟. 1.字符串不变性下面这张图展示了这段代码做 ...
当Scheduler拿不到url的时候，不能立即退出
在webmagic的多线程抓取中有一个比较麻烦的问题:当Scheduler拿不到url的时候,不能立即退出,需要等到没抓完的线程都运行完毕,没有新url产生时,才能退出.之前使用Thread.sle ...
摘要算法CRC8、CRC16、CRC32，MD2 、MD4、MD5，SHA1、SHA256、SHA384、SHA512，RIPEMD、PANAMA、TIGER、ADLER32
1.CRC8.CRC16.CRC32 CRC(Cyclic Redundancy Check,循环冗余校验)算法出现时间较长,应用也十分广泛,尤其是通讯领域,现在应用最多的就是 CRC32 算法,它产 ...
ios应用接入微信开放平台
前几天试了一下服务端接入微信公众平台,昨天又看了一下APP接入开放平台开放平台和公众平台的差别公众平台针对的是公众账号,除了提供管理后台之外.也开放了若干接口,让微信server和开发人员自己的应 ...
不起眼的 z-index 却能牵扯出这么大的学问（转）
z-index在日常开发中算是一个比较常用的样式,一般理解就是设置标签在z轴先后顺序,z-index值大的显示在最前面,小的则会被遮挡,是的,z-index的实际作用就是这样. 但是你真的了解z-in ...
初始化openwrt的rootpassword
更改openwrt源代码 shadow 文件 package/base-files/files/etc/shadow shadow 文件參考http://blog.csdn.net/u01164188 ...

[文学阅读] METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

[文学阅读] METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments的更多相关文章

随机推荐

热门专题