Link of the Paper: https://arxiv.org/abs/1412.2306

Main Points:

  1. An Alignment Model: Convolutional Neural Networks over image regions ( An image -> RCNN -> Top 19 detected locations in addition to the whole image -> the representations based on the pixels Ib inside each bounding box -> a set of h-dimensional vectors {vi | i = 1 ... 20} ), Bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding ( CNN - Structured Objective - BiRNN ).
  2. A Multimodal Recurrent Neural Network architecture: On the image side, Convolutional Neural Networks ( CNNs ) have recently emerged as a powerful class of models for image classification and object detection. On the sentence side, our work takes advantage of pretrained word vectors to obtain low-dimensional representations of words. Finally, Recurrent Neural Networks have been previously used in language modeling, but we additionally condition these models on images.
  3. Authors use bidirectional recurrent neural network to compute word representations in the sentence, dispensing of the need to compute dependency trees and allowing unbounded interactions of words and their context in the sentence.

Other Key Points:

  1. The primary challenge towards generating descriptions of images is in the design of a model that is rich enough to simultaneously reason about contents of images and their representation in the domain of natural language. Additionally, the model should be free of assumptions about specific hard-coded templates, rules or categories and instead rely on learning from the training data. The second, practical challenge is that datasets of image captions are available in large quantities on the internet, but these descriptions multiplex mentions of several entities whose locations in the images are unknown.

Paper Reading - Deep Visual-Semantic Alignments for Generating Image Descriptions ( CVPR 2015 )的更多相关文章

  1. Paper Reading - Deep Captioning with Multimodal Recurrent Neural Networks ( m-RNN ) ( ICLR 2015 ) ★

    Link of the Paper: https://arxiv.org/pdf/1412.6632.pdf Main Points: The authors propose a multimodal ...

  2. Paper Reading - Show and Tell: A Neural Image Caption Generator ( CVPR 2015 )

    Link of the Paper: https://arxiv.org/abs/1411.4555 Main Points: A generative model ( NIC, GoogLeNet ...

  3. Deep Visual-Semantic Alignments for Generating Image Descriptions(深度视觉-语义对应对于生成图像描述)

    https://cs.stanford.edu/people/karpathy/deepimagesent/ Abstract We present a model that generates na ...

  4. Paper Reading:Deep Neural Networks for YouTube Recommendations

    论文:Deep Neural Networks for YouTube Recommendations 发表时间:2016 发表作者:(Google)Paul Covington, Jay Adams ...

  5. Paper Reading:Deep Neural Networks for Object Detection

    发表时间:2013 发表作者:(Google)Szegedy C, Toshev A, Erhan D 发表刊物/会议:Advances in Neural Information Processin ...

  6. 论文笔记:Visual Semantic Navigation Using Scene Priors

    Visual Semantic Navigation Using Scene Priors 2018-10-21 19:39:26 Paper:  https://arxiv.org/pdf/1810 ...

  7. Paper Reading: Stereo DSO

    开篇第一篇就写一个paper reading吧,用markdown+vim写东西切换中英文挺麻烦的,有些就偷懒都用英文写了. Stereo DSO: Large-Scale Direct Sparse ...

  8. 论文笔记:Improving Deep Visual Representation for Person Re-identification by Global and Local Image-language Association

    Improving Deep Visual Representation for Person Re-identification by Global and Local Image-language ...

  9. 论文:利用深度强化学习模型定位新物体(VISUAL SEMANTIC NAVIGATION USING SCENE PRIORS)

    这是一篇被ICLR 2019 接收的论文.论文讨论了如何利用场景先验知识 (scene priors)来定位一个新场景(novel scene)中未曾见过的物体(unseen objects).举例来 ...

随机推荐

  1. C# WinForm开发系列 - ListBox/ListView/Panel【zz】

    原文传送:http://www.cnblogs.com/peterzb/archive/2009/06/18/1505424.html 1.ColorListBox   ColorListBox.zi ...

  2. react 配置开发环境

    一:先自行下载安装node和npm 二:cnpm install create-react-app -g 三:create-react-app my-project 四:cd my-project  ...

  3. DateFormat多线程使用问题

    reference DateFormat in a Multithreading Environment

  4. mybatis的Mapper.xml文件SQL语句BadSqlGrammarException之FUNCTION错误系列

    想必各位在开发过程中一定使用过:统计的功能,用到了很多SQL的函数,于是就直接写在Mapper文件中了: 比如: member_num,MAX(ID) AS newestLoanID,MIN (ID) ...

  5. redefinition of class解决

    垃圾玩意我在这儿翻车了. 编译器:Code::Block(懒得用VS,而且又太大了,CB小,而且也就一个控制台程序) Note to myself: 写完一个class的文件定义,编译,通过之后: 1 ...

  6. #leetcode刷题之路16-最接近的三数之和

    给定一个包括 n 个整数的数组 nums 和 一个目标值 target.找出 nums 中的三个整数,使得它们的和与 target 最接近.返回这三个数的和.假定每组输入只存在唯一答案. 例如,给定数 ...

  7. python3 datetime和time获取当前日期和时间

    import datetime import time # 获取当前时间, 其中中包含了year, month, hour, 需要import datetime today = datetime.da ...

  8. windows系统,MongoDB开启用户验证登录的正确姿势

    MongoDB默认安装并没有开启用户名密码登录,这样太不安全了,百度出来的开启验证登录的文章,对初次使用MongoDB的小白太不友好了,总结下经验,自己写一份指引. 1,我的安装路径是C:\Progr ...

  9. vue+element 页面输入框--回车导致页面刷新的问题

    el-form 后面加上 @submit.native.prevent

  10. 课时9.HTML发展史(了解)

    这个图片里的时间不用都记住,只需要记住一些特殊的,1993年,1995年(在W3C接手以后,才有了真正意义上的标准),1999年这几个时间 WHATWG的目的是推广HTML的标准,HTML5是浏览器厂 ...