bleu全称为Bilingual Evaluation Understudy（双语评估替换），是2002年提出的用于评估机器翻译效果的一种方法，这种方法简单朴素、短平快、易于理解。因为其效果还算说得过去，因此被广泛迁移到自然语言处理的各种评估任务中。这种方法可以说是：山上无老虎，猴子称大王。时无英雄遂使竖子成名。蜀中无大将，廖化做先锋。

问题描述

首先，对bleu算法建立一个直观的印象。

有两类问题：

1、给定一个句子和一个候选句子集，求bleu值，此问题称为sentence_bleu

2、给定一堆句子和一堆候选句子集，求bleu值，此问题称为corpus_bleu

机器翻译得到的句子称为candidate，候选句子集称为references。

计算方式就是计算candidate和references的公共部分。公共部分越多，说明翻译结果越好。

给定一个句子和一个候选句子集计算bleu值

bleu考虑1，2，3，4共4个n-gram，可以给每个n-gram指定权重。

对于n-gram：

对candidate和references分别分词（n-gram分词）
统计candidate和references中每个word的出现频次
对于candidate中的每个word，它的出现频次不能大于references中最大出现频次

这一步是为了整治形如the the the the the这样的candidate，因为the在candidate中出现次数太多了，导致分值为1。为了限制这种不正常的candidate，使用正常的references加以约束。
candidate中每个word的出现频次之和除以总的word数，即为得分score
score乘以句子长度惩罚因子即为最终的bleu分数

这一步是为了整治短句子，比如candidate只有一个词：the，并且the在references中出现过，这就导致得分为1。也就是说，有些人因为怕说错而保持沉默。

bleu的发展不是一蹴而就的，很多人为了修正bleu，不断发现bleu的漏洞并提出解决方案。从bleu的发展历程上，我们可以学到如何设计规则整治badcase。

最后，对于1-gram，2-gram，3-gram的组合，应该采用几何平均，也就是s1^w1*s2^2*s3^w3，而不是算术平均w1*s1+w2*s2+w3*s3。

from collections import Counter

import numpy as np

from nltk.translate import bleu_score

def bp(references, candidate):

    # brevity penality,句子长度惩罚因子

    ind = np.argmin([abs(len(i) - len(candidate)) for i in references])

    if len(references[ind]) < len(candidate):

        return 1

    scale = 1 - (len(candidate) / len(references[ind]))

    return np.e ** scale

def parse_ngram(sentence, gram):

    # 把一个句子分成n-gram

    return [sentence[i:i + gram] for i in range(len(sentence) - gram + 1)]  # 此处一定要注意+1，否则会少一个gram

def sentence_bleu(references, candidate, weight):

    bp_value = bp(references, candidate)

    s = 1

    for gram, wei in enumerate(weight):

        gram = gram + 1

        # 拆分n-gram

        ref = [parse_ngram(i, gram) for i in references]

        can = parse_ngram(candidate, gram)

        # 统计n-gram出现次数

        ref_counter = [Counter(i) for i in ref]

        can_counter = Counter(can)

        # 统计每个词在references中的出现次数

        appear = sum(min(cnt, max(i.get(word, 0) for i in ref_counter)) for word, cnt in can_counter.items())

        score = appear / len(can)

        # 每个score的权值不一样

        s *= score ** wei

    s *= bp_value  # 最后的分数需要乘以惩罚因子

    return s

references = [

    "the dog jumps high",

    "the cat runs fast",

    "dog and cats are good friends"

]

candidate = "the d o g  jump s hig"

weights = [0.25, 0.25, 0.25, 0.25]

print(sentence_bleu(references, candidate, weights))

print(bleu_score.sentence_bleu(references, candidate, weights))

一个corpus是由多个sentence组成的，计算corpus_bleu并非求sentence_bleu的均值，而是一种略微复杂的计算方式，可以说是没什么道理的狂想曲。

corpus_bleu

一个文档包含3个句子，句子的分值分别为a1/b1，a2/b2，a3/b3。

那么全部句子的分值为：(a1+a2+a3)/(b1+b2+b3)

惩罚因子也是一样：三个句子的长度分别为l1,l2,l3，对应的最接近的reference分别为k1,k2,k3。那么相当于bp(l1+l2+l3,k1+k2+k3)。

也就是说：对于corpus_bleu不是单纯地对sentence_bleu求均值，而是基于更统一的一种方法。

from collections import Counter

import numpy as np

from nltk.translate import bleu_score

def bp(references_len, candidate_len):

    if references_len < candidate_len:

        return 1

    scale = 1 - (candidate_len / references_len)

    return np.e ** scale

def parse_ngram(sentence, gram):

    return [sentence[i:i + gram] for i in range(len(sentence) - gram + 1)]

def corpus_bleu(references_list, candidate_list, weights):

    candidate_len = sum(len(i) for i in candidate_list)

    reference_len = 0

    for candidate, references in zip(candidate_list, references_list):

        ind = np.argmin([abs(len(i) - len(candidate)) for i in references])

        reference_len += len(references[ind])

    s = 1

    for index, wei in enumerate(weights):

        up = 0  # 分子

        down = 0  # 分母

        gram = index + 1

        for candidate, references in zip(candidate_list, references_list):

            # 拆分n-gram

            ref = [parse_ngram(i, gram) for i in references]

            can = parse_ngram(candidate, gram)

            # 统计n-gram出现次数

            ref_counter = [Counter(i) for i in ref]

            can_counter = Counter(can)

            # 统计每个词在references中的出现次数

            appear = sum(min(cnt, max(i.get(word, 0) for i in ref_counter)) for word, cnt in can_counter.items())

            up += appear

            down += len(can)

        s *= (up / down) ** wei

    return bp(reference_len, candidate_len) * s

references = [

    [

        "the dog jumps high",

        "the cat runs fast",

        "dog and cats are good friends"],

    [

        "ba ga ya",

        "lu ha a df",

    ]

]

candidate = ["the d o g  jump s hig", 'it is too bad']

weights = [0.25, 0.25, 0.25, 0.25]

print(corpus_bleu(references, candidate, weights))

print(bleu_score.corpus_bleu(references, candidate, weights))

如果你用的NLTK版本是3.2，发布时间是2016年3月份，那么计算corpus_bleu时有一处bug。NLTK在2016年10月份已经修复了此处bug。对于句子分值的求和，NLTK代码中是使用Fraction，Fraction会自动对分子和分母进行化简，导致求和的时候计算错误。

简化代码

在计算sentence_bleu和corpus_bleu过程中，许多步骤都是相似的、可以合并的。精简后的代码如下：

from collections import Counter

import numpy as np

from nltk.translate import bleu_score

def bp(references_len, candidate_len):

    return np.e ** (1 - (candidate_len / references_len)) if references_len > candidate_len else 1

def nearest_len(references, candidate):

    return len(references[np.argmin([abs(len(i) - len(candidate)) for i in references])])

def parse_ngram(sentence, gram):

    return [sentence[i:i + gram] for i in range(len(sentence) - gram + 1)]

def appear_count(references, candidate, gram):

    ref = [parse_ngram(i, gram) for i in references]

    can = parse_ngram(candidate, gram)

    # 统计n-gram出现次数

    ref_counter = [Counter(i) for i in ref]

    can_counter = Counter(can)

    # 统计每个词在references中的出现次数

    appear = sum(min(cnt, max(i.get(word, 0) for i in ref_counter)) for word, cnt in can_counter.items())

    return appear, len(can)

def corpus_bleu(references_list, candidate_list, weights):

    candidate_len = sum(len(i) for i in candidate_list)

    reference_len = sum(nearest_len(references, candidate) for candidate, references in zip(candidate_list, references_list))

    bp_value = bp(reference_len, candidate_len)

    s = 1

    for index, wei in enumerate(weights):

        up = 0  # 分子

        down = 0  # 分母

        gram = index + 1

        for candidate, references in zip(candidate_list, references_list):

            appear, total = appear_count(references, candidate, gram)

            up += appear

            down += total

        s *= (up / down) ** wei

    return bp_value * s

def sentence_bleu(references, candidate, weight):

    bp_value = bp(nearest_len(references, candidate), len(candidate))

    s = 1

    for gram, wei in enumerate(weight):

        gram = gram + 1

        appear, total = appear_count(references, candidate, gram)

        score = appear / total

        # 每个score的权值不一样

        s *= score ** wei

    # 最后的分数需要乘以惩罚因子

    return s * bp_value

if __name__ == '__main__':

    references = [

        [

            "the dog jumps high",

            "the cat runs fast",

            "dog and cats are good friends"],

        [

            "ba ga ya",

            "lu ha a df",

        ]

    ]

    candidate = ["the d o g  jump s hig", 'it is too bad']

    weights = [0.25, 0.25, 0.25, 0.25]

    print(corpus_bleu(references, candidate, weights))

    print(bleu_score.corpus_bleu(references, candidate, weights))

    print(sentence_bleu(references[0], candidate[0], weights))

    print(bleu_score.sentence_bleu(references[0], candidate[0], weights))

参考资料

https://cloud.tencent.com/developer/article/1042161

https://en.wikipedia.org/wiki/BLEU

https://blog.csdn.net/qq_31584157/article/details/77709454

https://www.jianshu.com/p/15c22fadcba5

理解bleu的更多相关文章

关于机器翻译评价指标BLEU(bilingual evaluation understudy)的直觉以及个人理解
最近我在做Natural Language Generating的项目,接触到了BLEU这个指标,虽然知道它衡量的是机器翻译的效果,也在一些文献的experiment的部分看到过该指标,但我实际上经常 ...
机器翻译评测——BLEU算法详解
◆版权声明:本文出自胖喵~的博客,转载必须注明出处. 转载请注明出处:http://www.cnblogs.com/by-dream/p/7679284.html 前言近年来,在自然语言研究领域中, ...
机器翻译评价指标 — BLEU算法
1,概述机器翻译中常用的自动评价指标是 $BLEU$ 算法,除了在机器翻译中的应用,在其他的 $seq2seq$ 任务中也会使用,例如对话系统. 2 $BLEU$算法详解假定人工给出的译文为$re ...
利用BLEU进行机器翻译检测（Python-NLTK-BLEU评分方法）
双语评估替换分数(简称BLEU)是一种对生成语句进行评估的指标.完美匹配的得分为1.0,而完全不匹配则得分为0.0.这种评分标准是为了评估自动机器翻译系统的预测结果而开发的,具备了以下一些优点: 计算 ...
Deep Learning基础--机器翻译BLEU与Perplexity详解
前言近年来,在自然语言研究领域中,评测问题越来越受到广泛的重视,可以说,评测是整个自然语言领域最核心和关键的部分.而机器翻译评价对于机器翻译的研究和发展具有重要意义:机器翻译系统的开发者可以通过评测 ...
阅读关于DuReader：百度大规模的中文机器阅读理解数据集
很久之前就得到了百度机器阅读理解关于数据集的这篇文章,今天才进行总结!.... 论文地址:https://arxiv.org/abs/1711.05073 自然语言处理是人工智能皇冠上的明珠,而机器阅 ...
对于文本生成类4种评价指标的的计算BLEU METEOR ROUGE CIDEr
github下载链接:https://github.com/Maluuba/nlg-eval 将下载的文件放到工程目录,而后使用如下代码计算结果具体的写作格式如下: from nlgeval imp ...
机器阅读理解综述Neural Machine Reading Comprehension Methods and Trends(略读笔记)
标题:Neural Machine Reading Comprehension: Methods and Trends 作者:Shanshan Liu, Xin Zhang, Sheng Zhang, ...
理解CSS视觉格式化
前面的话 CSS视觉格式化这个词可能比较陌生,但说起盒模型可能就恍然大悟了.实际上,盒模型只是CSS视觉格式化的一部分.视觉格式化分为块级和行内两种处理方式.理解视觉格式化,可以确定得到的效果是应 ...

随机推荐

Twitter雪花算法 SnowFlake算法的java实现
概述 SnowFlake算法是Twitter设计的一个可以在分布式系统中生成唯一的ID的算法,它可以满足Twitter每秒上万条消息ID分配的请求,这些消息ID是唯一的且有大致的递增顺序. 原理 Sn ...
【图片识别】Java中使用tess4J进行图片文字识别（支持中文）（转）
http://blog.csdn.net/wsk1103/article/details/54173282 java中识别文字比较简单,使用的软件是tesseractocr(使用的版本是3.02,3以 ...
Flask学习笔记
1.路由用"/"结尾. 比如@app.route("/about/"),可以匹配/about和/about/,而@app.route("/about& ...
Mongoose JS findOne always returns null
[问题] I've been fighting with trying to get Mongoose to return data from my local MongoDB instance; ...
[Algorithm] Binary tree: Level Order Traversal
function Node(val) { return { val, left: null, right: null }; } function Tree() { return { root: nul ...
【摘录】在Windows平台上使用Objective-C
虽然到目前为止最好的Objective-C 编码平台来自苹果公司,但它们绝不仅适用于苹果公司的平台.Objective-C 在Linux.BSD 甚至Windows 等其他平台都有相当久远的历史.根据 ...
转载：Unicode和Utf-8有何区别转载自知乎原文作者不详
作者:于洋链接:https://www.zhihu.com/question/23374078/answer/69732605来源:知乎著作权归作者所有.商业转载请联系作者获得授权,非商业转载请注明出 ...
使用 HTML5, javascript, webrtc, websockets, Jetty 和 OpenCV 实现基于 Web 的人脸识别
这是一篇国外的文章,介绍如何通过 WebRTC.OpenCV 和 WebSocket 技术实现在 Web 浏览器上的人脸识别,架构在 Jetty 之上. 实现的效果包括: 还能识别眼睛人脸识别的核心 ...
Android 如何关闭Navigation Bar M
前言欢迎大家我分享和推荐好用的代码段~~ 声明欢迎转载,但请保留文章原始出处: CSDN:http://www.csdn.net ...
JAVA中JDBC连接Mysql数据库简单测试
一.引用库 maven库:mysql:mysql-connector-java:6.0.6 二.SDK环境 JAVA JDK10 三.测试代码 package com.mysql.mysqlconne ...

理解bleu