Paper Reading - Deep Captioning with Multimodal Recurrent Neural Networks ( m-RNN ) ( ICLR 2015 ) ★

Link of the Paper: https://arxiv.org/pdf/1412.6632.pdf

Main Points:

The authors propose a multimodal Recurrent Neural Networks ( AlexNet/VGGNet + a multimodal layer + RNNs ). Their work has two major differences from these methods. Firstly, they incorporate a two-layer word embedding system in the m-RNN network structure which learns the word representation more efficiently than the single-layer word embedding. Secondly, they do not use the recurrent layer to store the visual information. The image representation is inputted to the m-RNN model along with every word in the sentence description.
Most of the sentence-image multimodal models use pre-computed word embedding vectors as the initialization of their models. In contrast, the authors randomly initialize their word embedding layers and learn them from the training data.
The m-RNN model is trained using a log-likelihood cost function. The errors can be backpropagated to the three parts ( the vision part, the language part, and the ) of the m-RNN model to update the model parameters simultaneously.
The hyperparameters, such as layer dimensions and the choice of the non-linear activation functions, are tuned via cross-validation on Flickr8K dataset and are then fixed across all the experiments.

Other Key Points:

Applications for Image Captioning: early childhood education, image retrieval, and navigation for the blind.
There are generally three categories of methods for generating novel sentence descriptions for images. The first category assumes a specific rule of the language grammer. They parse the sentence and divide it into several parts. This kind of method generates sentences that are syntactically correct. The second category retrieves similar captioned images, and generates new descriptions by generalizing and re-composing the retrieved captions. The third category of methods, which is more related to our method, learns a probability density over the space of multimodal inputs, using for example, Deep Boltzmann Machines, and topic models. They generate sentences with richer and more flexible structure than the first group. The probability of generating sentences using the model can serve as the affinity metric for retrieval.
Many previous methods treat the task of describing images as a retrieval task and formulate the problem as a ranking or embedding learning problem. They first extract the word and sentence features ( e.g. Socher et al.(2014) uses dependency tree Recursive Neural Network to extract sentence features ) as well as the image features. Then they optimize a ranking cost to learn an embedding model that maps both the sentence feature and the image feature to a common semantic feature space ( the same semantic space ). In this way, they can directly calculate the distance between images and sentences. These methods genarate image captions by retrieving them from a sentence database. Thus, they lack the ability of generating novel sentences or describing images that contain novel combinations of objects and scenes.
Benchmark datasets for Image Captioning: IAPR TC-12 ( Grubinger et al.(2006) ), Flickr8K ( Rashtchian et al.(2010) ), Flickr30K ( Young et al.(2014) ) and MS COCO ( Lin et al.(2014) ).
Evaluation Metrics for Sentence Generation: Sentence perplexity and BLUE scores.
Tasks related to Image Captioning: Generating Novel Sentences, Retrieving Images Given a Sentence, Retrieving Sentences Given an Image.
The m-RNN model is trained using Baidu's internal deep learning platform PADDLE.

Paper Reading - Deep Captioning with Multimodal Recurrent Neural Networks ( m-RNN ) ( ICLR 2015 ) ★的更多相关文章

Paper Reading - Sequence to Sequence Learning with Neural Networks ( NIPS 2014 )
Link of the Paper: https://arxiv.org/pdf/1409.3215.pdf Main Points: Encoder-Decoder Model: Input seq ...
递归神经网络（Recurrent Neural Networks，RNN）
在深度学习领域,传统的多层感知机(MLP)具有出色的表现,取得了许多成功,它曾在许多不同的任务上——包括手写数字识别和目标分类上创造了记录.甚至到了今天,MLP在解决分类任务上始终都比其他方法要略胜一 ...
Paper Reading - Deep Visual-Semantic Alignments for Generating Image Descriptions ( CVPR 2015 )
Link of the Paper: https://arxiv.org/abs/1412.2306 Main Points: An Alignment Model: Convolutional Ne ...
Paper Reading - Mind’s Eye: A Recurrent Visual Representation for Image Caption Generation ( CVPR 2015 )
Link of the Paper: https://ieeexplore.ieee.org/document/7298856/ A Correlative Paper: Learning a Rec ...
Hyperspectral Image Classification Using Similarity Measurements-Based Deep Recurrent Neural Networks
用RNN来做像素分类,输入是一系列相近的像素,长度人为指定为l,相近是利用像素相似度或是范围相似度得到的,计算个欧氏距离或是SAM. 数据是两个高光谱数据 1.Pavia University,Ref ...
Attention and Augmented Recurrent Neural Networks
Attention and Augmented Recurrent Neural Networks CHRIS OLAHGoogle Brain SHAN CARTERGoogle Brain Sep ...
The Unreasonable Effectiveness of Recurrent Neural Networks (RNN)
http://karpathy.github.io/2015/05/21/rnn-effectiveness/ There’s something magical about Recurrent Ne ...
课程五(Sequence Models)，第一周（Recurrent Neural Networks） —— 1.Programming assignments：Building a recurrent neural network - step by step
Building your Recurrent Neural Network - Step by Step Welcome to Course 5's first assignment! In thi ...
(zhuan) Attention in Long Short-Term Memory Recurrent Neural Networks
Attention in Long Short-Term Memory Recurrent Neural Networks by Jason Brownlee on June 30, 2017 in ...

随机推荐

黑少微服务商店之Iron Cloud微服务开发云
近日,由黑少微服务研发团队推出的Iron Cloud微服务开发云已经正式对外提供服务,这是国内第一家基于云端操作的微服务专业开发工具. Iron Cloud 微服务开发云(www.ironz.com) ...
CentOS7 搭建RabbitMQ集群后台管理历史消费记录查看
简介通过 Erlang 的分布式特性(通过 magic cookie 认证节点)进行 RabbitMQ 集群,各 RabbitMQ 服务为对等节点,即每个节点都提供服务给客户端连接,进行消息发送与接 ...
webpack4.x最详细入门讲解
前言本文主要从webpack4.x入手,会对平时常用的Webpack配置一一讲解,各个功能点都有对应的详细例子,所以本文也比较长,但如果你能动手跟着本文中的例子完整写一次,相信你会觉得Webpack ...
Ubuntu18.04挂载exfat格式移动硬盘
1.安装exfat-fuse 命令:sudo apt-get install exfat-fuse 2.重新插拔移动硬盘,即可识别查看挂载命令:lsblk
vue项目使用微信公众号支付总结
微信公众号支付 1. 使用jssdk调用微信支付,具体查看开发文档: 使用的vuex,在mutations中 wechatPay (state, data) { state.payObject = d ...
Redis 持久化深入--机制、可靠性及比较
本文是对 antirez 博客中 Redis persistence demystified 的翻译和总结.主要从Redis的持久化机制,提供何种程度的可靠性以及与其他数据库的比较三个方面进行讨论. ...
Python enumerate()方法
for循环中如果要获取当前元素的索引值,一个方法是定义一个计数器,每次取值的时候将这个值加一,如果是列表的话可以用index()函数,而python中有一个比较简洁的方法而已直接获得索引值,并可以方便 ...
IDL返回众数（数组中出现次数最多的值）
对于整型数组,可以直接利用histogram函数可以实现,示例如下: IDL>array = [1, 1, 2 , 4, 1, 3, 3, 2, 4, 5, 3, 2, 2, 1, 2, 6, ...
随笔三安装Linux操作系统
一.虚拟机安装Ubuntu图文教程]在自己笔记本上安装Linux操作系统我参考了VirtualBox虚拟机安装Ubuntu的图文教程,根据图片和所附内容一步步的将虚拟机安装到位,没看安装教程之前完全 ...
20155308&20155316 2017-2018-1 《信息安全系统设计基础》实验一
20155308&20155316 2017-2018-1 <信息安全系统设计基础>实验一此次实验我和黄月同学一起做了1.2.3.5项,第4项在实验课上做完了,但是没有按时提交. ...

Paper Reading - Deep Captioning with Multimodal Recurrent Neural Networks ( m-RNN ) ( ICLR 2015 ) ★

Paper Reading - Deep Captioning with Multimodal Recurrent Neural Networks ( m-RNN ) ( ICLR 2015 ) ★的更多相关文章

随机推荐

热门专题