Link of the Paper:


  • The authors propose a novel learning based discriminative evaluation metric that is directly trained to distinguish between human and machine-generated captions. They train an automatic critique to distinguish generated captions from human-written ones, and then score candidate captions by how successful they are in fooling the critique. Formally, given a critique parametrized by Θ, a reference image i, and a generated caption c, the score is defined as the probability for the caption of being human-written, as assigned by the critique: scoreΘ(c, i) = P(c is human written | i, Θ). More generally, the reference image represents the context in which the generated caption is evaluated. To provide further information about the relevance and salience of the image content, a reference caption can additionally be supplied to the context. Let C(i) denotes the context of image i, then reference caption c could be included as part of context, i.e. cC(i). The score with context becomes scoreΘ(c, i) = P(c is human written | C(i), Θ).


  • To systematically create pathological sentences, the authors define several transformations to generate unnatural sentences that might get high scores in an evaluation metric. Their proposed data augmentation scheme uses these transformations to generate large number of negative examples. Formally, a transformation Τ takes an image-caption dataset and generates a new one: Τ({(c, i) ∈ D}; γ) = {(c1', i1'), ..., (cn', in')}, where i, ii' are images, c, ci' are captions, D is a list of caption-image tuples representing the original dataset, and γ is a hyper-parameter that controls the strength of the transformation. Specifically, authors define following three transformations to generate pathological image-captions pairs:

    • Random Captions ( RC ): To ensure the metric pays attention to the image content, they randomly sample human written captions from other images in the training set: TRC(D; γ) = {(c', i) | (c, i), (c', i') ∈ D, i'Nγ(i)}, where Nγ(i) represents the set of images that are top γ percent nearest neighbors to image i.
    • Word Permutation ( WP ): To make sure that their metric pays attention to sentence structure, authors randomly permute at least 2 words in the reference caption: TWP(D; γ) = {(c', i) | (c, i) ∈ D, c'Pγ(c) \ {c}}, where Pγ(c) represents all sentences generated by permuting γ percent of words in caption c.
    • Random Word ( RW ): To explore rare words authors replace from 2 to all words of the reference caption with random words from the vocabulary: TRW(D; γ) = {(c', i) | (c, i) ∈ D, c'Wγ(c) \ {c}}, where Wγ(c) represents all sentences generated by randomly replacing γ percent words from caption c.

  • The authors propose a systematic approach to measure the robustness of an evaluation metric to a given pathological transformation.

General Points:

  • Commonly used evaluation metrics for Image Captioning: BLEU, METEOR, ROUGE, CIDEr, SPICE. These metrics face two challenges. Firstly, many metrics fail to correlate well with human judgments. Metrics based on measuring word overlap between candidate and reference captions find it difficult to capture semantic meaning of a sentence, therefore often lead to bad correlation with human judgments. Secondly, each evaluation metric has its well-known blind spot, and rule-based metrics are often inflexible to be responsive to new pathological cases.
  • Compact Bilinear Pooling ( CBP ) has been demonstrated in Multimodal compact bilinear pooling for visual question answering and visual grounding to be very effective in combining heterogeneous information of image and text.

Paper Reading - Learning to Evaluate Image Captioning ( CVPR 2018 ) ★的更多相关文章

  1. Paper Reading - Convolutional Image Captioning ( CVPR 2018 )

    Link of the Paper: Motivation: LSTM units are complex and inherentl ...

  2. Paper Reading - Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images ( ICCV 2015 )

    Link of the Paper: Innovations: The authors propose the Novel V ...

  3. Paper Reading: Stereo DSO

    开篇第一篇就写一个paper reading吧,用markdown+vim写东西切换中英文挺麻烦的,有些就偷懒都用英文写了. Stereo DSO: Large-Scale Direct Sparse ...

  4. 读paper笔记[Learning to rank]

    读paper笔记[Learning to rank] by Jiawang 选读paper: [1] Ranking by calibrated AdaBoost, R. Busa-Fekete, B ...

  5. 在矩池云上复现 CVPR 2018 LearningToCompare_FSL 环境

    这是 CVPR 2018 的一篇少样本学习论文:Learning to Compare: Relation Network for Few-Shot Learning 源码地址:https://git ...

  6. 爬取CVPR 2018过程中遇到的坑

    爬取 CVPR 2018 过程中遇到的坑 使用语言及模块 语言: Python 3.6.6 模块: re requests lxml bs4 过程 一开始都挺顺利的,先获取到所有文章的链接再逐个爬取获 ...

  7. Paper Reading - Convolutional Sequence to Sequence Learning ( CoRR 2017 ) ★

    Link of the Paper: Motivation: Compared to recurrent layers, convol ...

  8. Paper Reading - Deep Captioning with Multimodal Recurrent Neural Networks ( m-RNN ) ( ICLR 2015 ) ★

    Link of the Paper: Main Points: The authors propose a multimodal ...

  9. Paper Reading - Deep Visual-Semantic Alignments for Generating Image Descriptions ( CVPR 2015 )

    Link of the Paper: Main Points: An Alignment Model: Convolutional Ne ...


  1. es6 数组扩展方法

    1.扩展运算符 含义: 扩展运算符,三个点(...),将一个数组转为用逗号分隔的参数顺序. 例如: console.log([1,2,3]); console.log(...[1,2,3]);   结 ...

  2. WebAPI 实现前后端分离的示例

    转自: 随着Web技术的发展,现在各种框架,前端的,后端的,数不胜数.全栈工程师的压力越来越大. 现在的前端的框架, ...

  3. dataTable学习心得

    1.引用文件 <link rel="stylesheet" href=" ...

  4. windows搭建本地IIS服务器+php安装+移动设备内网访问服务器

    启动IIS服务 1. 打开 “控制面板” => "程序" => "启用或关闭Window功能": 2. 接着勾选相应设置: 3. 继续勾选对应目录下 ...

  5. 用JQ实现的一个简单轮播

    <!DOCTYPE html><html><head> <meta charset="utf-8"> <title>lb ...

  6. ueditor 富文本编辑器 Uncaught TypeError: Cannot set property 'innerHTML' of undefined问题

    ueditor.addListener("ready", function () { ueditor.setContent(‘内容'); });

  7. thinkphp5使用workerman定时器定时爬取某站点新闻资讯等内容

    1.首先通过 composer 安装workerman,在thinkphp5完全开发手册的扩展->coposer包->workerman有详细说明: #在项目根目录执行以下指令compos ...

  8. rubymine自动转义双引号

    如果你使用rubymine在编写JSON字符串的时候,然后要一个一个\去转义双引号的话,就实在太不应该了,又烦又容易出错.在rubymine可以使用Inject language帮我们自动转义双引号 ...

  9. BZOJ3293_分金币_KEY

    题目传送门 设x[i]表示i+1向i传的糖果数,x[n]表示1向n传的糖果数,a'=(a[1]+...a[N])/N a[1]+x[1]−x[n]=a' a[2]+x[2]−x[1]=a' a[3]+ ...

  10. spring源码-国际化-3.5

    一.国际化在实际代码中是非常常见的一中方式.为了结合web做一下语言上面的切换,而达到展示的目的. 二.这里呢,主要是介绍spring中对于国际化做了哪些处理. 三.实现方式 1)xml配置 < ...