METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments

Satanjeev Banerjee   Alon Lavie 

Language Technologies Institute  

Carnegie Mellon University  

Pittsburgh, PA 15213  

banerjee+@cs.cmu.edu  alavie@cs.cmu.edu

Important Snippets:

1. In  order  to  be  both  effective  and  useful,  an automatic metric for MT evaluation has to satisfy several basic criteria.  The primary and most intuitive requirement is that the metric have very high correlation with
quantified human notions of MT quality.  Furthermore, a good metric should be as sensitive as possible to differences in MT quality between  different  systems,  and  between  different versions of the same system.  The metric should be 

consistent  (same  MT  system  on  similar  texts should produce similar scores), reliable (MT systems that score similarly can be trusted to perform similarly) and general (applicable to different MT tasks in a wide range of domains and scenarios).  Needless
to say, satisfying all of the above criteria is  extremely  difficult,  and  all  of  the metrics  that have been proposed so far fall short of adequately addressing  most  if  not  all  of  these requirements.

2. It  is  based  on  an explicit word-to-word  matching  between  the  MT  output being evaluated and one or more reference translations.    Our  current  matching  supports  not  only matching  between  words that are  identical
in the two  strings  being  compared,  but  can  also  match words  that  are  simple  morphological  variants  of each other

3. Each possible matching is scored based on a combination of several features.  These  currently  include  uni-gram-precision,  uni-gram-recall, and a direct measure of how out-of-order the words of the MT output are with respect to
the reference.

4.Furthermore, our results demonstrated that recall plays a more important role than precision  in  obtaining  high-levels  of  correlation  with human judgments.

5.BLEU does not take recall into account directly.

6.BLEU  does  not  use  recall  because  the notion of recall is unclear when matching simultaneously  against  a  set  of  reference  translations (rather than a single reference).  To compensate for recall, BLEU uses a Brevity
Penalty, which penalizes translations for being “too short”.

7.BLEU  and  NIST  suffer  from  several  weaknesses:

>The Lack of Recall

>Use  of Higher Order  N-grams

>Lack  of  Explicit  Word-matching  Between Translation and Reference

>Use  of  Geometric  Averaging  of  N-grams

8.METEOR was designed to explicitly address the weaknesses in BLEU identified above.  It evaluates a  translation  by  computing  a  score  based  on  explicit  word-to-word  matches  between  the  translation and a reference
translation. If more than one reference translation is available, the given translation  is  scored  against  each  reference  independently,  and  the  best  score  is  reported.

9.Given a pair of translations to be compared (a system  translation  and  a  reference  translation), METEOR  creates  an alignment between  the  two strings. We define an alignment as a mapping be-tween unigrams, such that
every unigram in each string  maps  to  zero  or  one  unigram  in  the  other string, and to no unigrams in the same string.

10.This  alignment  is  incrementally  produced through a series of stages, each stage consisting of  two distinct phases.

11.In the first phase an external module lists all the possible  unigram  mappings  between  the  two strings.

12.Different modules map unigrams based  on  different  criteria.  The  “exact”  module maps  two  unigrams  if  they  are  exactly  the  same (e.g.  “computers”  maps  to  “computers”  but  not “computer”). The “porter stem”
module maps two unigrams  if  they  are  the  same after they  are stemmed  using  the  Porter  stemmer  (e.g.:  “com-puters”  maps  to  both  “computers”  and  to  “com-puter”).  The  “WN  synonymy”  module  maps  two unigrams if they are synonyms of each
other.

13.In  the  second  phase  of  each  stage,  the  largest subset of these unigram mappings is selected such 

that  the  resulting  set  constitutes  an alignment as defined above

14. METEOR selects that set that has the least number of unigram mapping crosses.

15.By default the first stage uses the “exact” mapping  module,  the  second  the  “porter  stem” module and the third the “WN synonymy” module.

16. unigram precision (P)

unigram  recall  (R)

Fmean by combining the precision and recall via a harmonic-mean

watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvaWN0MjAxNA==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast" alt="">

To  take  into  account  longer matches, METEOR computes a penalty for a given alignment as follows.

chunks such that  the  uni-grams  in  each  chunk  are  in  adjacent  positions  in the system translation, and are also mapped to uni-grams that are in adjacent positions in the reference translation.

Conclusion: METEOR prefer recall to precision while BLEU is converse.Meanwhile, it incorporates many information.

版权声明:本文博客原创文章,博客,未经同意,不得转载。

[文学阅读] METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments的更多相关文章

  1. (zhuan) Recurrent Neural Network

    Recurrent Neural Network 2016年07月01日  Deep learning  Deep learning 字数:24235   this blog from: http:/ ...

  2. Paper Reading - Learning to Evaluate Image Captioning ( CVPR 2018 ) ★

    Link of the Paper: https://arxiv.org/abs/1806.06422 Innovations: The authors propose a novel learnin ...

  3. 《30天学习30种新技术》-Day 15:Meteor —— 从零开始创建一个 Web 应用

    目录:https://segmentfault.com/a/1190000000349384 原文: https://segmentfault.com/a/1190000000361440 到目前为止 ...

  4. 读书笔记——莫提默·J.艾德勒&查尔斯·范多伦(美)《如何阅读一本书》

    第一篇 阅读的层次 第一章 阅读的活力与艺术 阅读的目标:娱乐.获得资讯.增进理解力这本书是为那些想把读书的主要目的当作是增进理解能力的人而写.何谓阅读艺术?这是一个凭借着头脑运作,除了玩味读物中的一 ...

  5. 如何阅读一本书——分析阅读Pre

    如何阅读一本书--分析阅读Pre 前情介绍 作者: 莫提默.艾德勒 查尔斯.范多伦 初版:1940年,一出版就是全美畅销书榜首一年多.钢铁侠Elon.Musk学过. 需要注意的句子: 成功的阅读牵涉到 ...

  6. BLEU (Bilingual Evaluation Understudy)

    什么是BLEU? BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text w ...

  7. 机器翻译质量评测算法-BLEU

    机器翻译领域常使用BLEU对翻译质量进行测试评测.我们可以先看wiki上对BLEU的定义. BLEU (Bilingual Evaluation Understudy) is an algorithm ...

  8. cvpr2015papers

    @http://www-cs-faculty.stanford.edu/people/karpathy/cvpr2015papers/ CVPR 2015 papers (in nicer forma ...

  9. {ICIP2014}{收录论文列表}

    This article come from HEREARS-L1: Learning Tuesday 10:30–12:30; Oral Session; Room: Leonard de Vinc ...

随机推荐

  1. 由一道淘宝面试题到False sharing问题

    今天在看淘宝之前的一道面试题目,内容是 在高性能服务器的代码中经常会看到类似这样的代码: typedef union { erts_smp_rwmtx_t rwmtx; byte cache_line ...

  2. 使用Boost库中的组件进行C++内存管理

    C++标准库中的auto_ptr,智能指针,部分的解决了获取资源自动释放的问题 在Boost中,提供了6中智能指针:scoped_ptr, scoped_array, shared_ptr, shar ...

  3. [IDEs]Eclipse自动格式化代码

    格式化代码快捷键:Ctrl + Shift + F 一般情况: 1).Ctrl + A 2).Ctrl + Shift + F ps: 格式化之后发现代码换行了,因为已经达到最大长度,可修改设置,增加 ...

  4. 理解Spring的Bean工厂

    一提到工厂,我们先来回顾前面学习过的工厂方法和抽象工厂模式: 工厂方法:针对产品维度,能够产生新的产品,也能够产生新的产品工厂,既能够扩展产品维度.可是假设我们想在普通工厂上生产产品系列,就会特别麻烦 ...

  5. window.open()具体解释及浏览器兼容性问题

    一.基本的语法:window.open(pageURL,name,parameters)当中:pageURL 为子窗体路径name  为子窗体名字parameters 为窗体參数(各參数用逗号分隔) ...

  6. ADN中国团队參加微软的Kinect全国大赛获得三等奖

    上周末我们团队參加了微软的Kinect全国大赛,我们的Kinect + Navisworks漫游荣膺三等奖   团队经理Joe写了篇详实的总结,我就直接转载了. http://blog.csdn.ne ...

  7. 读取USB HDD(USB移动硬盘信息)序列号的代码

    读取USB HDD(USB移动硬盘)序列号的代码,型号及分位. 使用Visual Studio 2010编译成功. 代码使用了CrystalDiskInfo中的代码smartata.c中相关代码: 例 ...

  8. 7.MongoDB java CRUD

    注意:要增加mongodb对应的jar包 package cn.toto.mongodb; import java.net.UnknownHostException; import org.bson. ...

  9. unix解释器文件详解

    exec执行普通文件和解释器文件的区别 2014-11-15 23:52:45 分类: LINUX exec执行普通文件和解释器文件的区别 ——lvyilong316 1. 从一个问题开始 首先要从项 ...

  10. 【足迹C++primer】30、概要(泛型算法)

    概要(泛型算法) 大多数算法的头文件中定义algorithm在. 标准库也是第一个文件numeric它定义了一套通用算法. #include<iostream> #include<n ...