Hierarchical Question-Image Co-Attention for Visual Question Answering

Hierarchical Question-Image Co-Attention for Visual Question Answering

NIPS 2016

Paper: https://arxiv.org/pdf/1606.00061.pdf

Code: https://github.com/jiasenlu/HieCoAttenVQA

Introduction：

　　本文提出了一种新的联合图像和文本特征的协同显著性的概念，使得两个不同模态的特征可以相互引导。

　　此外，作者也对输入的文本信息，从多个角度进行加权处理，构建多个不同层次的 image-question co-attention maps，即：word-level，phrase-level and question-level。

　　最后，在 phrase level，我们提出一种新颖的卷积-池化策略（convolution-pooling strategy）来自适应的选择 the phase size。

Methods：

1. Notation：

　　问题 Q = {q₁, ... , q_T}，其中 q_t 是第 t 个单词的特征向量。我们用 q_t^w, q_t^p, q_t^s 分别表示在位置 t 处的 Word embedding，phrase embedding 以及 question embedding。

　　图像特征表示为 V = {v₁, ... ,v_N}，其中，v_n 是空间位置 n 处的特征向量。

　　图像和问题的 co-attention features 在每一个层次，都可以表示为：v^, q^。

　　不同模块和层的权重可以表示为 W。

2. Question Hierarchy：

　　给定 the 1-hot encoding of the question words Q, 我们首先将单词映射到单词空间，以得到：Q^w. 为了计算词汇的特征，我们采用在单词映射向量上采用 1-D 卷积。具体来说，在每一个单词位置，我们计算 the Word vectors with filters of three window sizes 的内积：unigram, bigram and trigram. 对于第 t 个单词，在窗口大小为 s 时的卷积输出为：

　　其中，W_c^s 是权重参数。单词级别的向量 Q^w是 approximately 0-padding before feeding into bigram and trigram convolutions to maintain the length of the sequence after convolution. 给定卷积的结果，我们然后在每一个单词位置，跨越不同的 n-grams 采用 max-pooling 以得到 phrase-level features：

　　我们的 pooling method 不同于前人的方法，可以自适应的选择 different gram features at each time step, 并且可以保持原始序列的长度和序列。我们利用 LSTM 来编码 max-pooling 之后的 sequence 。对应的 question-level feature 是第 t 个时间步骤的 LSTM hidden vector。

3. Co-Attention：

　　我们提出两种协同显著的机制（two co-attention mechanism），第一种是 parallel co-attention，同时产生 image 和 question attention。第二种是 alternating co-attention，顺序的产生 image 和 question attentions。如图2所示，这些 co-attention mechanisms 可以在所有问题等级上执行。

　　【Parallel Co-Attention】 这种 attention 机制尝试同时对 image 和 question 进行 attend。我们通过计算图像和问题特征在所有的 image-locations and question-locations 进行相似度的计算。具体来说，给定一个图像特征图 V，以及问题的表达 Q，放射矩阵（the affinity matrix）C 可以计算如下：

　　其中，W_b 包括了权重。在计算得到 affinity matrix 之后，计算 image attention 的一种可能的方法是：simply maximize out the affinity over the locations of other modality, i.e.

　　并非选择 the max activation，我们发现如果我们将这个 affinity matrix 看做是一个 feature，然后学习去预测 image 和 question attention maps 可以提升最终的结果：

　　其中 Wv 和 Wq，w_hv，w_hq 是权重参数。a^v 和 a^q 是每一个图像区域 v_n 和单词 q_t 的 attention probability。放射矩阵 C 将 question attention space 转换为 image attention space. 基于上述 attention weights，图像和问题 attention vectors 可以看做是 image feature 和 question feature 的加权求和：

　　【Alternating Co-Attention】分步的协同 attention ，简单来讲，包括三个步骤：

　　1）summarize the question into a single vecror q;

　　2）attend to the image based on the question summary q ;

　　3）attend to the question based on the attended image feature.

　　我们定义 attention operation x^ = A(X; g)，将图像特征 X 以及从问题得到的 attention guidance g 作为输入，然后输出 the attended image vector。这些操作可以表达为：

　　其中，空心符号1 是元素全为 1 的向量。

4. Encoding for Predicting Answers :

　　我们将 VQA 看做是一个 classification task，我们从所有的三个层次的 attended image and question features 来预测答案。我们用 MLP 来迭代的编码 the attention features：

Experiments：

Hierarchical Question-Image Co-Attention for Visual Question Answering的更多相关文章

论文阅读：Learning Visual Question Answering by Bootstrapping Hard Attention
Learning Visual Question Answering by Bootstrapping Hard Attention Google DeepMind ECCV-2018 2018 ...
论文：Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering-阅读总结
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering-阅读总结笔记不能简单的抄写文中 ...
Visual Question Answering with Memory-Augmented Networks
Visual Question Answering with Memory-Augmented Networks 2018-05-15 20:15:03 Motivation: 虽然 VQA 已经取得 ...
Learning Conditioned Graph Structures for Interpretable Visual Question Answering
Learning Conditioned Graph Structures for Interpretable Visual Question Answering 2019-05-29 00:29:4 ...
【自然语言处理】--视觉问答（Visual Question Answering，VQA）从初始到应用
一.前述视觉问答(Visual Question Answering,VQA),是一种涉及计算机视觉和自然语言处理的学习任务.这一任务的定义如下: A VQA system takes as inp ...
论文笔记：Visual Question Answering as a Meta Learning Task
Visual Question Answering as a Meta Learning Task ECCV 2018 2018-09-13 19:58:08 Paper: http://openac ...
A Regularized Competition Model for Question Diffi culty Estimation in Community Question Answering Services-20160520
1.Information publication:EMNLP 2014 author:Jing Liu(在前一篇sigir基础上,拓展模型的论文) 2.What 衡量CQA中问题的困难程度,提出从两 ...
(zhuan) Recurrent Neural Network
Recurrent Neural Network 2016年07月01日 Deep learning Deep learning 字数:24235 this blog from: http:/ ...
香侬科技独家对话Facebook人工智能研究院首席科学家Devi Parikh
Facebook 人工智能研究院(FAIR)首席科学家 Devi Parikh 是 2017 年 IJCAI 计算机和思想奖获得者(IJCAI 两个最重要的奖项之一,被誉为国际人工智能领域的「菲尔兹奖 ...

随机推荐

html5-article元素
<!DOCTYPE html><html lang="en"><head> <meta charset="UTF-8&qu ...
Quick-Cocos2d-x文件结构分析
在上一章我们讲过了Quick-Cocos2d-x中的环境搭建,这章我们分析下quick中的文件结构吧!打开quick的文件夹,可以看到如下的这些目录和文件: bin:存放各种与引擎相关的脚本 comp ...
css 箭头
.toTop{ width: 2.5rem; height: 2.5rem; background-color: rgba(228,228,228,.6); position: fixed; bott ...
STM32 一个定时器产生4路独立调频率，占中比可调，脉冲个数可以统计。
实现这个功能,基本原理是利用STM32 的输出比较功能. 1.其它设置就是普通定时器的设置这里开启,四个输出比较中断,和一个更新中断, 更新中断这里不需要开也可以达到目的,我这里开启是做其它的用处的. ...
Spark With Mongodb 实现方法及error code -5, 6, 13127解决方案
1.spark mongo 读取 val rdd = MongoSpark.builder().sparkSession(spark).pipeline(Seq(`match`(regex(" ...
go语言，golang学习笔记4 用beego跑一个web应用
go语言,golang学习笔记4 用beego跑一个web应用首页 - beego: 简约 & 强大并存的 Go 应用框架https://beego.me/ 更新的命令是加个 -u 参数,g ...
Django 应用静态文件配置
Django 应用 <!DOCTYPE html> <html lang="en"> <head> <meta charset=" ...
一个六年Java程序员的从业总结：比起掉发，我更怕掉队
我一直担惊受怕,过去,可能是因为我年轻,但现在,我已经不是那么年轻了,我仍然发现有很多事情让我害怕. 当年纪越来越大后,我开始变得不能加班.我开始用更多的时间和家人在一起,而不是坐在计算机前(尽管这样 ...
php 采集爬取单个淘宝商品描述，商品属性
下载链接:https://download.csdn.net/download/a724008158/10723448 效果图:
eclipse maven Errors while generating javadoc on java8
With JDK 8, we are unable to get Javadoc unless your tool meets the standards of doclint. Some of it ...

Hierarchical Question-Image Co-Attention for Visual Question Answering

Hierarchical Question-Image Co-Attention for Visual Question Answering的更多相关文章

随机推荐

热门专题