Paper Reading - Learning to Evaluate Image Captioning ( CVPR 2018 ) ★

Link of the Paper: https://arxiv.org/abs/1806.06422

Innovations:

The authors propose a novel learning based discriminative evaluation metric that is directly trained to distinguish between human and machine-generated captions. They train an automatic critique to distinguish generated captions from human-written ones, and then score candidate captions by how successful they are in fooling the critique. Formally, given a critique parametrized by Θ, a reference image i, and a generated caption c, the score is defined as the probability for the caption of being human-written, as assigned by the critique: score_Θ(c, i) = P(c is human written | i, Θ). More generally, the reference image represents the context in which the generated caption is evaluated. To provide further information about the relevance and salience of the image content, a reference caption can additionally be supplied to the context. Let C(i) denotes the context of image i, then reference caption c could be included as part of context, i.e. c∈C(i). The score with context becomes score_Θ(c, i) = P(c is human written | C(i), Θ).

To systematically create pathological sentences, the authors define several transformations to generate unnatural sentences that might get high scores in an evaluation metric. Their proposed data augmentation scheme uses these transformations to generate large number of negative examples. Formally, a transformation Τ takes an image-caption dataset and generates a new one: Τ({(c, i) ∈ D}; γ) = {(c₁', i₁'), ..., (c_n', i_n')}, where i, i_i' are images, c, c_i' are captions, D is a list of caption-image tuples representing the original dataset, and γ is a hyper-parameter that controls the strength of the transformation. Specifically, authors define following three transformations to generate pathological image-captions pairs:
- Random Captions ( RC ): To ensure the metric pays attention to the image content, they randomly sample human written captions from other images in the training set: T_RC(D; γ) = {(c', i) | (c, i), (c', i') ∈ D, i'∈N_γ(i)}, where N_γ(i) represents the set of images that are top γ percent nearest neighbors to image i.
- Word Permutation ( WP ): To make sure that their metric pays attention to sentence structure, authors randomly permute at least 2 words in the reference caption: T_WP(D; γ) = {(c', i) | (c, i) ∈ D, c' ∈ P_γ(c) \ {c}}, where P_γ(c) represents all sentences generated by permuting γ percent of words in caption c.
- Random Word ( RW ): To explore rare words authors replace from 2 to all words of the reference caption with random words from the vocabulary: T_RW(D; γ) = {(c', i) | (c, i) ∈ D, c' ∈ W_γ(c) \ {c}}, where W_γ(c) represents all sentences generated by randomly replacing γ percent words from caption c.

The authors propose a systematic approach to measure the robustness of an evaluation metric to a given pathological transformation.

General Points:

Commonly used evaluation metrics for Image Captioning: BLEU, METEOR, ROUGE, CIDEr, SPICE. These metrics face two challenges. Firstly, many metrics fail to correlate well with human judgments. Metrics based on measuring word overlap between candidate and reference captions find it difficult to capture semantic meaning of a sentence, therefore often lead to bad correlation with human judgments. Secondly, each evaluation metric has its well-known blind spot, and rule-based metrics are often inflexible to be responsive to new pathological cases.
Compact Bilinear Pooling ( CBP ) has been demonstrated in Multimodal compact bilinear pooling for visual question answering and visual grounding to be very effective in combining heterogeneous information of image and text.

Paper Reading - Learning to Evaluate Image Captioning ( CVPR 2018 ) ★的更多相关文章

Paper Reading - Convolutional Image Captioning ( CVPR 2018 )
Link of the Paper: https://arxiv.org/abs/1711.09151 Motivation: LSTM units are complex and inherentl ...
Paper Reading - Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images ( ICCV 2015 )
Link of the Paper: https://arxiv.org/pdf/1504.06692.pdf Innovations: The authors propose the Novel V ...
Paper Reading: Stereo DSO
开篇第一篇就写一个paper reading吧,用markdown+vim写东西切换中英文挺麻烦的,有些就偷懒都用英文写了. Stereo DSO: Large-Scale Direct Sparse ...
读paper笔记[Learning to rank]
读paper笔记[Learning to rank] by Jiawang 选读paper: [1] Ranking by calibrated AdaBoost, R. Busa-Fekete, B ...
在矩池云上复现 CVPR 2018 LearningToCompare_FSL 环境
这是 CVPR 2018 的一篇少样本学习论文:Learning to Compare: Relation Network for Few-Shot Learning 源码地址:https://git ...
爬取CVPR 2018过程中遇到的坑
爬取 CVPR 2018 过程中遇到的坑使用语言及模块语言: Python 3.6.6 模块: re requests lxml bs4 过程一开始都挺顺利的,先获取到所有文章的链接再逐个爬取获 ...
Paper Reading - Convolutional Sequence to Sequence Learning ( CoRR 2017 ) ★
Link of the Paper: https://arxiv.org/abs/1705.03122 Motivation: Compared to recurrent layers, convol ...
Paper Reading - Deep Captioning with Multimodal Recurrent Neural Networks ( m-RNN ) ( ICLR 2015 ) ★
Link of the Paper: https://arxiv.org/pdf/1412.6632.pdf Main Points: The authors propose a multimodal ...
Paper Reading - Deep Visual-Semantic Alignments for Generating Image Descriptions ( CVPR 2015 )
Link of the Paper: https://arxiv.org/abs/1412.2306 Main Points: An Alignment Model: Convolutional Ne ...

随机推荐

用HTML编写迪士尼乐园页面
<!DOCTYPE html><html xmlns="http://www.w3.org/1999/html"><head lang="e ...
Redis的数据类型以及各类型的操作
讲完安装和配置,接下来就是所有数据库的重头戏,数据结构和常用操作的增删改查了 redis是key-value的数据结构,每条数据都是⼀个键值对键的类型是字符串注意:键不能重复值的类型分为五种: ...
Java 序列化与反序列化（Serialization）
一.什么是?为什么需要? 序列化(Serialization)是将对象的状态信息转化为可以存储或者传输的形式的过程,反序列化则为其逆过程. 内存的易失性:传输需要:一些应用场景中需要将对象持久化下来, ...
如何给ioloop.run_sync()中调用的函数传入参数
问题如何给tornado.ioloop.IOLoop中的run_sync方法中调用的函数添加参数解决方案使用functools.partial 解决示例 from tornado import ...
Flink+Kafka整合的实例
Flink+Kafka整合实例 1.使用工具Intellig IDEA新建一个maven项目,为项目命名为kafka01. 2.我的pom.xml文件配置如下. <?xml version=&q ...
MySQL 5.7.21 免安装版配置教程
MySQL是世界上目前最流行的开源数据库.许多大厂的核心存储往往都是MySQL. 要安装MySQL,可以直接去官方网站下载.本教程将说明对于MySQL的免安装版如何进行配置和安装. 官方下载:http ...
Linux重启命令介绍
下面介绍在 Linux 操作系统中重启和关闭相关的命令:shutdown.reboot.init.halt.poweroff.systemctl,你可以根据需要来选择适合的 Linux 命令关闭或重新 ...
opencv3 学习五 - 合并与分割通道
合并与分割通道程序如下 #include "opencv2/opencv.hpp" using namespace cv; int main() { Mat original = ...
详解CSS中的几种长度px、em、pt
说说css的几种距离吧,大致有px.em.pt.pc.in.mm.cm.ex八种,其中最常见到的是px,我还见到过的有ex和mm.cm,当然后两个在当年见的更多. 其实px,我们最熟悉,而在电脑上也应 ...
C# 获取UTC 转换时间戳为C#时间
获取UTC /// <summary> /// 获取时间戳 /// </summary> /// <returns>UTC</returns> publ ...

Paper Reading - Learning to Evaluate Image Captioning ( CVPR 2018 ) ★

Paper Reading - Learning to Evaluate Image Captioning ( CVPR 2018 ) ★的更多相关文章

随机推荐

热门专题