Paper Reading - Deep Captioning with Multimodal Recurrent Neural Networks ( m-RNN ) ( ICLR 2015 ) ★
Link of the Paper: https://arxiv.org/pdf/1412.6632.pdf
Main Points:
- The authors propose a multimodal Recurrent Neural Networks ( AlexNet/VGGNet + a multimodal layer + RNNs ). Their work has two major differences from these methods. Firstly, they incorporate a two-layer word embedding system in the m-RNN network structure which learns the word representation more efficiently than the single-layer word embedding. Secondly, they do not use the recurrent layer to store the visual information. The image representation is inputted to the m-RNN model along with every word in the sentence description.
- Most of the sentence-image multimodal models use pre-computed word embedding vectors as the initialization of their models. In contrast, the authors randomly initialize their word embedding layers and learn them from the training data.
- The m-RNN model is trained using a log-likelihood cost function. The errors can be backpropagated to the three parts ( the vision part, the language part, and the ) of the m-RNN model to update the model parameters simultaneously.
- The hyperparameters, such as layer dimensions and the choice of the non-linear activation functions, are tuned via cross-validation on Flickr8K dataset and are then fixed across all the experiments.
Other Key Points:
- Applications for Image Captioning: early childhood education, image retrieval, and navigation for the blind.
- There are generally three categories of methods for generating novel sentence descriptions for images. The first category assumes a specific rule of the language grammer. They parse the sentence and divide it into several parts. This kind of method generates sentences that are syntactically correct. The second category retrieves similar captioned images, and generates new descriptions by generalizing and re-composing the retrieved captions. The third category of methods, which is more related to our method, learns a probability density over the space of multimodal inputs, using for example, Deep Boltzmann Machines, and topic models. They generate sentences with richer and more flexible structure than the first group. The probability of generating sentences using the model can serve as the affinity metric for retrieval.
- Many previous methods treat the task of describing images as a retrieval task and formulate the problem as a ranking or embedding learning problem. They first extract the word and sentence features ( e.g. Socher et al.(2014) uses dependency tree Recursive Neural Network to extract sentence features ) as well as the image features. Then they optimize a ranking cost to learn an embedding model that maps both the sentence feature and the image feature to a common semantic feature space ( the same semantic space ). In this way, they can directly calculate the distance between images and sentences. These methods genarate image captions by retrieving them from a sentence database. Thus, they lack the ability of generating novel sentences or describing images that contain novel combinations of objects and scenes.
- Benchmark datasets for Image Captioning: IAPR TC-12 ( Grubinger et al.(2006) ), Flickr8K ( Rashtchian et al.(2010) ), Flickr30K ( Young et al.(2014) ) and MS COCO ( Lin et al.(2014) ).
- Evaluation Metrics for Sentence Generation: Sentence perplexity and BLUE scores.
- Tasks related to Image Captioning: Generating Novel Sentences, Retrieving Images Given a Sentence, Retrieving Sentences Given an Image.
- The m-RNN model is trained using Baidu's internal deep learning platform PADDLE.
Paper Reading - Deep Captioning with Multimodal Recurrent Neural Networks ( m-RNN ) ( ICLR 2015 ) ★的更多相关文章
- Paper Reading - Sequence to Sequence Learning with Neural Networks ( NIPS 2014 )
Link of the Paper: https://arxiv.org/pdf/1409.3215.pdf Main Points: Encoder-Decoder Model: Input seq ...
- 递归神经网络(Recurrent Neural Networks,RNN)
在深度学习领域,传统的多层感知机(MLP)具有出色的表现,取得了许多成功,它曾在许多不同的任务上——包括手写数字识别和目标分类上创造了记录.甚至到了今天,MLP在解决分类任务上始终都比其他方法要略胜一 ...
- Paper Reading - Deep Visual-Semantic Alignments for Generating Image Descriptions ( CVPR 2015 )
Link of the Paper: https://arxiv.org/abs/1412.2306 Main Points: An Alignment Model: Convolutional Ne ...
- Paper Reading - Mind’s Eye: A Recurrent Visual Representation for Image Caption Generation ( CVPR 2015 )
Link of the Paper: https://ieeexplore.ieee.org/document/7298856/ A Correlative Paper: Learning a Rec ...
- Hyperspectral Image Classification Using Similarity Measurements-Based Deep Recurrent Neural Networks
用RNN来做像素分类,输入是一系列相近的像素,长度人为指定为l,相近是利用像素相似度或是范围相似度得到的,计算个欧氏距离或是SAM. 数据是两个高光谱数据 1.Pavia University,Ref ...
- Attention and Augmented Recurrent Neural Networks
Attention and Augmented Recurrent Neural Networks CHRIS OLAHGoogle Brain SHAN CARTERGoogle Brain Sep ...
- The Unreasonable Effectiveness of Recurrent Neural Networks (RNN)
http://karpathy.github.io/2015/05/21/rnn-effectiveness/ There’s something magical about Recurrent Ne ...
- 课程五(Sequence Models),第一 周(Recurrent Neural Networks) —— 1.Programming assignments:Building a recurrent neural network - step by step
Building your Recurrent Neural Network - Step by Step Welcome to Course 5's first assignment! In thi ...
- (zhuan) Attention in Long Short-Term Memory Recurrent Neural Networks
Attention in Long Short-Term Memory Recurrent Neural Networks by Jason Brownlee on June 30, 2017 in ...
随机推荐
- 用画布canvas画安卓logo
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8&quo ...
- #leetcode刷题之路16-最接近的三数之和
给定一个包括 n 个整数的数组 nums 和 一个目标值 target.找出 nums 中的三个整数,使得它们的和与 target 最接近.返回这三个数的和.假定每组输入只存在唯一答案. 例如,给定数 ...
- 20181101noip模拟赛T1
思路: 我们看到这道题,可以一眼想到一维差分 但这样的复杂度是O(nq)的,显然会T 那么怎么优化呢? 我们会发现,差分的时候,在r~r+l-1的范围内 差分增加的值横坐标相同,纵坐标递增 减小的值横 ...
- 【Linux】文件、目录权限及归属
访问权限: 可读(read):允许查看文件内容.显示目录列表 可写(write):允许修改文件内容,允许在目录中新建.移动.删除文件或子目录 可执行(execute):允许运行程序.切换目录 归属: ...
- ajax与jsonp定义及使用方法
ajax 定义 ajax技术的目的是让javascript发送http请求,与后台通信,获取数据和信息. ajax通信的过程不会影响后续javascript的执行,从而实现异步. 同步和异步 现实生活 ...
- Homebrew(brew)安装MySQL成功后无法登录
Homebrew简称brew,OSX上的软件包管理工具,在Mac终端可以通过brew安装.更新.卸载各种软件,(简直就是神器级武器). 废话不多说,没安装brew自己去百度学习安装,这里就不多说了. ...
- NFS服务的搭建
NFS服务的作用:提供网络文件系统给客户机 nfs服务器的安装配置和使用: 1.将已经制作好的文件系统rootfs_fs210_audio.tgz 拷贝到 /opt,并解压(这里的/opt目录是通过s ...
- 【Mac】解决「无法将 chromedriver 移动到 /usr/bin 目录下」问题
问题描述 在搭建 Selenium 库 + ChromeDriver 爬虫环境时,遇到了无法将 chromedriver 移动到 /usr/bin 目录下的问题,如下图: 一查原来是因为系统有一个 S ...
- 111. Climbing Stairs 【LintCode easy】
Description You are climbing a stair case. It takes n steps to reach to the top. Each time you can e ...
- Linux 下 终端 相关的命令
1. 概述 Linux 服务器, 通常可以由多个终端连接 简单介绍一些 终端 相关的操作 最终的目的, 是定位到某个终端, 然后把它 踢下来, 甚至可以不让他再次连接 2. 环境 操作系统 CentO ...