(zhuan) Attention in Long Short-Term Memory Recurrent Neural Networks
Attention in Long Short-Term Memory Recurrent Neural Networks
The Encoder-Decoder architecture is popular because it has demonstrated state-of-the-art results across a range of domains.
A limitation of the architecture is that it encodes the input sequence to a fixed length internal representation. This imposes limits on the length of input sequences that can be reasonably learned and results in worse performance for very long input sequences.
In this post, you will discover the attention mechanism for recurrent neural networks that seeks to overcome this limitation.
After reading this post, you will know:
- The limitation of the encode-decoder architecture and the fixed-length internal representation.
- The attention mechanism to overcome the limitation that allows the network to learn where to pay attention in the input sequence for each item in the output sequence.
- 5 applications of the attention mechanism with recurrent neural networks in domains such as text translation, speech recognition, and more.
Let’s get started.
Attention in Long Short-Term Memory Recurrent Neural Networks
Photo by Jonas Schleske, some rights reserved.
Problem With Long Sequences
The encoder-decoder recurrent neural network is an architecture where one set of LSTMs learn to encode input sequences into a fixed-length internal representation, and second set of LSTMs read the internal representation and decode it into an output sequence.
This architecture has shown state-of-the-art results on difficult sequence prediction problems like text translation and quickly became the dominant approach.
For example, see:
- Sequence to Sequence Learning with Neural Networks, 2014
- Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, 2014
The encoder-decoder architecture still achieves excellent results on a wide range of problems. Nevertheless, it suffers from the constraint that all input sequences are forced to be encoded to a fixed length internal vector.
This is believed to limit the performance of these networks, especially when considering long input sequences, such as very long sentences in text translation problems.
A potential issue with this encoder–decoder approach is that a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector. This may make it difficult for the neural network to cope with long sentences, especially those that are longer than the sentences in the training corpus.
— Dzmitry Bahdanau, et al., Neural machine translation by jointly learning to align and translate, 2015
Attention within Sequences
Attention is the idea of freeing the encoder-decoder architecture from the fixed-length internal representation.
This is achieved by keeping the intermediate outputs from the encoder LSTM from each step of the input sequence and training the model to learn to pay selective attention to these inputs and relate them to items in the output sequence.
Put another way, each item in the output sequence is conditional on selective items in the input sequence.
Each time the proposed model generates a word in a translation, it (soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated. The model then predicts a target word based on the context vectors associated with these source positions and all the previous generated target words.
… it encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation. This frees a neural translation model from having to squash all the information of a source sentence, regardless of its length, into a fixed-length vector.
— Dzmitry Bahdanau, et al., Neural machine translation by jointly learning to align and translate, 2015
This increases the computational burden of the model, but results in a more targeted and better-performing model.
In addition, the model is also able to show how attention is paid to the input sequence when predicting the output sequence. This can help in understanding and diagnosing exactly what the model is considering and to what degree for specific input-output pairs.
The proposed approach provides an intuitive way to inspect the (soft-)alignment between the words in a generated translation and those in a source sentence. This is done by visualizing the annotation weights… Each row of a matrix in each plot indicates the weights associated with the annotations. From this we see which positions in the source sentence were considered more important when generating the target word.
— Dzmitry Bahdanau, et al., Neural machine translation by jointly learning to align and translate, 2015
Problem with Large Images
Convolutional neural networks applied to computer vision problems also suffer from similar limitations, where it can be difficult to learn models on very large images.
As a result, a series of glimpses can be taken of a large image to formulate an approximate impression of the image before making a prediction.
One important property of human perception is that one does not tend to process a whole scene in its entirety at once. Instead humans focus attention selectively on parts of the visual space to acquire information when and where it is needed, and combine information from different fixations over time to build up an internal representation of the scene, guiding future eye movements and decision making.
— Recurrent Models of Visual Attention, 2014
These glimpse-based modifications may also be considered attention, but are not considered in this post.
See the papers.
- Recurrent Models of Visual Attention, 2014
- DRAW: A Recurrent Neural Network For Image Generation, 2014
- Multiple Object Recognition with Visual Attention, 2014
5 Examples of Attention in Sequence Prediction
This section provides some specific examples of how attention is used for sequence prediction with recurrent neural networks.
1. Attention in Text Translation
The motivating example mentioned above is text translation.
Given an input sequence of a sentence in French, translate and output a sentence in English. Attention is used to pay attention to specific words in the input sequence for each word in the output sequence.
We extended the basic encoder–decoder by letting a model (soft-)search for a set of input words, or their annotations computed by an encoder, when generating each target word. This frees the model from having to encode a whole source sentence into a fixed-length vector, and also lets the model focus only on information relevant to the generation of the next target word.
— Dzmitry Bahdanau, et al., Neural machine translation by jointly learning to align and translate, 2015
Attentional Interpretation of French to English Translation
Taken from Dzmitry Bahdanau, et al., Neural machine translation by jointly learning to align and translate, 2015
2. Attention in Image Descriptions
Different from the glimpse approach, the sequence-based attentional mechanism can be applied to computer vision problems to help get an idea of how to best use the convolutional neural network to pay attention to images when outputting a sequence, such as a caption.
Given an input of an image, output an English description of the image. Attention is used to pay focus on different parts of the image for each word in the output sequence.
We propose an attention based approach that gives state of the art performance on three benchmark datasets … We also show how the learned attention can be exploited to give more interpretability into the models generation process, and demonstrate that the learned alignments correspond very well to human intuition.
Attentional Interpretation of Output Words to Specific Regions on the Input Images
Taken from Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, 2016
— Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, 2016
3. Attention in Entailment
Given a premise scenario and a hypothesis about the scenario in English, output whether the premise contradicts, is not related, or entails the hypothesis.
For example:
- premise: “A wedding party taking pictures“
- hypothesis: “Someone got married“
Attention is used to relate each word in the hypothesis to words in the premise, and vise-versa.
We present a neural model based on LSTMs that reads two sentences in one go to determine entailment, as opposed to mapping each sentence independently into a semantic space. We extend this model with a neural word-by-word attention mechanism to encourage reasoning over entailments of pairs of words and phrases. … An extension with word-by-word neural attention surpasses this strong benchmark LSTM result by 2.6 percentage points, setting a new state-of-the-art accuracy…
— Reasoning about Entailment with Neural Attention, 2016
Attentional Interpretation of Premise Words to Hypothesis Words
Taken from Reasoning about Entailment with Neural Attention, 2016
4. Attention in Speech Recognition
Given an input sequence of English speech snippets, output a sequence of phonemes.
Attention is used to relate each phoneme in the output sequence to specific frames of audio in the input sequence.
… a novel end-to-end trainable speech recognition architecture based on a hybrid attention mechanism which combines both content and location information in order to select the next position in the input sequence for decoding. One desirable property of the proposed model is that it can recognize utterances much longer than the ones it was trained on.
— Attention-Based Models for Speech Recognition, 2015.
Attentional Interpretation of Output Phoneme Location to Input Frames of Audio
Taken from Attention-Based Models for Speech Recognition, 2015
5. Attention in Text Summarization
Given an input sequence of an English article, output a sequence of English words that summarize the input.
Attention is used to relate each word in the output summary to specific words in the input document.
… a neural attention-based model for abstractive summarization, based on recent developments in neural machine translation. We combine this probabilistic model with a generation algorithm which produces accurate abstractive summaries.
— A Neural Attention Model for Abstractive Sentence Summarization, 2015
Attentional Interpretation of Words in the Input Document to the Output Summary
Taken from A Neural Attention Model for Abstractive Sentence Summarization, 2015.
Further Reading
This section provides additional resources if you would like to learn more about adding attention to LSTMs.
- Attention and memory in deep learning and NLP
- Attention Mechanism
- Survey on Attention-based Models Applied in NLP
- What is exactly the attention mechanism introduced to RNN? on Quora.
- What is Attention Mechanism in Neural Networks?
Keras does not offer attention out of the box at the time of writing, but there are few third-party implementations. See:
- Deep Language Modeling for Question Answering using Keras
- Attention Model Available!
- Keras Attention Mechanism
- Attention and Augmented Recurrent Neural Networks
- How to add Attention on top of a Recurrent Layer (Text Classification)
- Attention Mechanism Implementation Issue
- Implementing simple neural attention model (for padded inputs)
- Attention layer requires another PR
- seq2seq library
Do you know of some good resources on attention in recurrent neural networks?
Let me know in the comments.
Summary
In this post, you discovered the attention mechanism for sequence prediction problems with LSTM recurrent neural networks.
Specifically, you learned:
- That the encoder-decoder architecture for recurrent neural networks uses a fixed-length internal representation that imposes a constraint that limits learning very long sequences.
- That attention overcomes the limitation in the encode-decoder architecture by allowing the network to learn where to pay attention to the input for each item in the output sequence.
- That the approach has been used across different types sequence prediction problems include text translation, speech recognition, and more.
Do you have any questions about attention in recurrent neural networks?
Ask your questions in the comments below and I will do my best to answer.
(zhuan) Attention in Long Short-Term Memory Recurrent Neural Networks的更多相关文章
- LSTM学习—Long Short Term Memory networks
原文链接:https://colah.github.io/posts/2015-08-Understanding-LSTMs/ Understanding LSTM Networks Recurren ...
- Attention and Augmented Recurrent Neural Networks
Attention and Augmented Recurrent Neural Networks CHRIS OLAHGoogle Brain SHAN CARTERGoogle Brain Sep ...
- [C5W1] Sequence Models - Recurrent Neural Networks
第一周 循环序列模型(Recurrent Neural Networks) 为什么选择序列模型?(Why Sequence Models?) 在本课程中你将学会序列模型,它是深度学习中最令人激动的内容 ...
- The Unreasonable Effectiveness of Recurrent Neural Networks (RNN)
http://karpathy.github.io/2015/05/21/rnn-effectiveness/ There’s something magical about Recurrent Ne ...
- 《The Unreasonable Effectiveness of Recurrent Neural Networks》阅读笔记
李飞飞徒弟Karpathy的著名博文The Unreasonable Effectiveness of Recurrent Neural Networks阐述了RNN(LSTM)的各种magic之处, ...
- 第十四章——循环神经网络(Recurrent Neural Networks)(第二部分)
本章共两部分,这是第二部分: 第十四章--循环神经网络(Recurrent Neural Networks)(第一部分) 第十四章--循环神经网络(Recurrent Neural Networks) ...
- 循环神经网络(RNN, Recurrent Neural Networks)介绍(转载)
循环神经网络(RNN, Recurrent Neural Networks)介绍 这篇文章很多内容是参考:http://www.wildml.com/2015/09/recurrent-neur ...
- 第十四章——循环神经网络(Recurrent Neural Networks)(第一部分)
由于本章过长,分为两个部分,这是第一部分. 这几年提到RNN,一般指Recurrent Neural Networks,至于翻译成循环神经网络还是递归神经网络都可以.wiki上面把Recurrent ...
- Pixel Recurrent Neural Networks翻译
Pixel Recurrent Neural Networks 目前主要在用的文档存放: https://www.yuque.com/lart/papers/prnn github存档: https: ...
随机推荐
- 针对IE6,IE7,IE8,IE9,FF等不同浏览器的CSS写法
首先我们介绍一下HACK原理,就是不同浏览器对字符的识别不同 在 CSS中常用特殊字符识别表: (1)*: IE6+IE7都能识别*,而标准浏览器FF+IE8是不能识别*的; (2)!importan ...
- 概念、DW介绍
网页设计知识点大致分为五个部分,分别是: 1.概念.DW介绍: 2.标签: 3.样式表CSS: 4.JQuery: 5.JavaScript 概念.DW介绍: 一.网页的基本结构 <!--文档声 ...
- 异常检测LOF
局部异常因子算法-Local Outlier Factor(LOF)在数据挖掘方面,经常需要在做特征工程和模型训练之前对数据进行清洗,剔除无效数据和异常数据.异常检测也是数据挖掘的一个方向,用于反作弊 ...
- 设计模式之Decorator(油漆工)(转)
Decorator常被翻译成"装饰",我觉得翻译成"油漆工"更形象点,油漆工(decorator)是用来刷油漆的,那么被刷油漆的对象我们称decoratee.这 ...
- bootsrtap h5 移动版页面 在苹果手机ios滑动上下拉动滚动卡顿问题解决方法
bootsrtap h5 移动版页面 在苹果手机ios滑动上下拉动滚动卡顿问题解决方法 bootsrtap框架做的h5页面,在android手机下没有卡顿问题,在苹果手机就一直存在这问题,开始毫无头绪 ...
- Codeforces 268B - Buttons
Manao is trying to open a rather challenging lock. The lock has n buttons on it and to open it, you ...
- Camera2点击对焦实现
https://www.jianshu.com/p/76225ac72b56 android从5.0开始,废弃了原有的Camera接口,提供了全新的Camera2接口.Camera2接口为了给app提 ...
- Golang获取int数组里的最大值和下标
package main import ( "fmt" ) func main() { //获取一个数组里最大值,并且拿到下标 //声明一个数组5个元素 ], , , ,} //假 ...
- be动词
编辑 讨论 be动词,意思和用法很多,一般的意思是:是,此种用法,有多种变化形式,is,am,are,was,were,being,been,to be.另外,be动词还有成为的意思.根据句子中不同的 ...
- JS传值中文乱码解决方案
JS传值中文乱码解决方案 一.相关知识 1,Java相关类: (1)java.net.URLDecoder类 HTML格式解码的实用工具类,有一个静态方法:public static String ...