Deep Visual-Semantic Alignments for Generating Image Descriptions（深度视觉-语义对应对于生成图像描述）

https://cs.stanford.edu/people/karpathy/deepimagesent/

Abstract

We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions signiﬁcantly outperform retrieval baselines on both full images and on a new dataset of region-level annotations.

我们展示了一个模型，它能生成图像和它们区域的自然语言描述。我们的方法杠杆平衡图像集与它们的句子描述，以学习语言和视觉数据之间内在模态的关系。我们的对齐模型是基于一种新的结合，图像区域的卷积神经网络，句子的双向递归神经网络，和通过多模态嵌入对齐两种模式的结构化目标。然后，我们描述了一种多模式递归神经网络架构，它是使用推断对齐方法来学习生成图像区域的新描述。我们证明我们的对齐模型在FLICKR8K、FLIKR30K和MSCCOO数据集的检索实验中产生最先进的结果。然后，我们表示，生成的描述显著地胜过无论是全图还是新的区域水平标注数据集的检索基线。

Code：链接其他

1. Introduction简介

A quick glance at an image is sufﬁcient for a human to point out and describe an immense amount of details about the visual scene [14]. However, this remarkable ability has proven to be an elusive task for our visual recognition models. The majority of previous work in visual recognition has focused on labeling images with a ﬁxed set of visual categories and great progress has been achieved in these endeavors [45, 11]. However, while closed vocabularies of visual concepts constitute a convenient modeling assumption, they are vastly restrictive when compared to the enormous amount of rich descriptions that a human can compose.

对人类来说快速地看一眼图片并指出并描述视觉场景的详细细节是足够的。但是，这个杰出的能力已证明对视觉识别模型来说是一个难以捉摸的任务。

Some pioneering approaches that address the challenge of generating image descriptions have been developed [29,13]. However, these models often rely on hard-coded visual concepts and sentence templates, which imposes limits on their variety. Moreover, the focus of these works has been on reducing complex visual scenes into a single sentence, which we consider to be an unnecessary restriction.

In this work, we strive to take a step towards the goal of generating dense descriptions of images (Figure 1). The primary challenge towards this goal is in the design of a model that is rich enough to simultaneously reason about contents of images and their representation in the domain of natural language. Additionally, the model should be free of assumptions about speciﬁc hard-coded templates, rules or categories and instead rely on learning from the training data. The second, practical challenge is that datasets of image captions are available in large quantities on the internet[21, 58, 37], but these descriptions multiplex mentions of several entities whose locations in the images are unknown.

Deep Visual-Semantic Alignments for Generating Image Descriptions（深度视觉-语义对应对于生成图像描述）的更多相关文章

Paper Reading - Deep Visual-Semantic Alignments for Generating Image Descriptions ( CVPR 2015 )
Link of the Paper: https://arxiv.org/abs/1412.2306 Main Points: An Alignment Model: Convolutional Ne ...
论文笔记：Visual Semantic Navigation Using Scene Priors
Visual Semantic Navigation Using Scene Priors 2018-10-21 19:39:26 Paper: https://arxiv.org/pdf/1810 ...
论文：利用深度强化学习模型定位新物体(VISUAL SEMANTIC NAVIGATION USING SCENE PRIORS)
这是一篇被ICLR 2019 接收的论文.论文讨论了如何利用场景先验知识 (scene priors)来定位一个新场景(novel scene)中未曾见过的物体(unseen objects).举例来 ...
论文笔记之：Pedestrian Detection aided by Deep Learning Semantic Tasks
Pedestrian Detection aided by Deep Learning Semantic Tasks CVPR 2015 本文考虑将语义任务(即:行人属性和场景属性)和行人检测相结合, ...
论文笔记：Improving Deep Visual Representation for Person Re-identification by Global and Local Image-language Association
Improving Deep Visual Representation for Person Re-identification by Global and Local Image-language ...
DSSM(DEEP STRUCTURED SEMANTIC MODELS)
Huang, Po-Sen, et al. "Learning deep structured semantic models for web search using clickthrou ...
Deep Learning 8_深度学习UFLDL教程：Stacked Autocoders and Implement deep networks for digit classification_Exercise（斯坦福大学深度学习教程）
前言 1.理论知识:UFLDL教程.Deep learning:十六(deep networks) 2.实验环境:win7, matlab2015b,16G内存,2T硬盘 3.实验内容:Exercis ...
Deep Learning 学习随记（五）深度网络--续
前面记到了深度网络这一章.当时觉得练习应该挺简单的,用不了多少时间,结果训练时间真够长的...途中debug的时候还手贱的clear了一下,又得从头开始运行.不过最终还是调试成功了,sigh~ 前一篇 ...
【ML】Predict and Constrain: Modeling Cardinality in Deep Structured Prediction -预测和约束：在深度结构化预测中建模基数
[论文标题]Predict and Constrain: Modeling Cardinality in Deep Structured Prediction (35th-ICML,PMLR) [ ...

随机推荐

delphi xe5 安卓配置sqlite
本篇我们介绍一下在android手机上怎样使用sqlite数据库,这里用Navigator实现增删改查. 1.新建firemonkey mobile application 2.选择blank ap ...
【转】Jmeter做功能测试的优点和不足
利用Jmeter做功能测试有以下优点: ● 不依赖于界面,如果服务正常启动,传递参数明确就可以添加测试用例,执行测试 ● 测试脚本不需要编程,熟悉http请求,熟悉业务流程,就可以根据页面 ...
juc线程池原理(五)：拒绝策略示例
概要拒绝策略介绍线程池的拒绝策略,是指当任务添加到线程池中被拒绝,而采取的处理措施.当任务添加到线程池中之所以被拒绝,可能是由于:第一,线程池异常关闭.第二,任务数量超过线程池的最大限制. 线程池 ...
Spring Session实现Session共享下的坑与建议
相信用过spring-session做session共享的朋友都很喜欢它的精巧易用-不依赖具体web容器.不需要修改已成项目的代码.笔者在使用spring-session的过程中也对spring-se ...
转载：细说oracle 11g rac 的ip地址
本文转载自:细说oracle 11g rac 的ip地址 http://blog.sina.com.cn/s/blog_4fe6d4250102v5fa.html 以前搭建oracle rac的时候( ...
免费SSL证书 - Let's Encrypt申请（WINDOWS + IIS版）
Let’s Encrypt 项目是由互联网安全研究小组ISRG,Internet Security Research Group主导并开发的一个新型数字证书认证机构CA,Certificate Aut ...
vue-cli中的babel配置文件.babelrc详解
本文介绍vue-cli脚手架工具根目录的babelrc配置文件介绍 es6特性浏览器还没有全部支持,但是使用es6是大势所趋,所以babel应运而生,用来将es6代码转换成浏览器能够识别的代码 ba ...
Oracle T4-2用jumpstart方式安装Solaris10
在安装过程中遇到了2个问题 1) 安装时无法识别硬RAID磁盘 T4-2的2块本地盘做了硬RAID,用jumpstart安装时无法识别硬RAID磁盘,报错信息如下: {0} ok boot net - ...
$().each和$.each()
$().each 在dom处理上面用的较多.如果页面有多个input标签类型为checkbox,对于这时用$().each来处理多个checkbook,例如: $(“input[name=’ch’]” ...
sftp put权限不够
报错如下: sftp> put play.zip ./ Uploading play.zip to /opt/library/./play.zip remote open("/opt/ ...

Deep Visual-Semantic Alignments for Generating Image Descriptions（深度视觉-语义对应对于生成图像描述）

Deep Visual-Semantic Alignments for Generating Image Descriptions（深度视觉-语义对应对于生成图像描述）的更多相关文章

随机推荐

热门专题