Link of the Paper: https://arxiv.org/abs/1706.03762

Motivation:

  • The inherently sequential nature of Recurrent Models precludes parallelization within training examples.
  • Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences. In all but a few cases, however, such attention mechanisms are used in conjunction with a recurrent network.

Innovation:

  • The first sequence transduction model, the Transformer, relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or Convolutions. The Transformer follows the overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.

    • Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. The authors employ a residual connection around each of the two sub-layers, followed by layer normalization. That is, the output of each sub-layer is LayerNorm (x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension dmodel = 512.
    • Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, they employ residual connections around each of the sub-layers, followed by layer normalization. They also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

  • Scaled Dot-Product Attention and Multi-Head Attention. The Transformer uses multi-head attention in three different ways:

    • In "encoder-decoder attention" layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [ Google’s neural machine translation system: Bridging the gap between human and machine translation. ].
    • The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder. [ More about Attention Definition ]
    • Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. They need to prevent leftward information flow in the decoder to preserve the auto-regressive property. They implement this inside of scaled dot-product attention by masking out (setting to -∞) all values in the input of the softmax which correspond to illegal connections. See Figure 2.

In terms of encoder-decoder, the query Q is usually the hidden state of the decoder. Whereas key K, is the hidden state of the encoder, and the corresponding value V is normalized weight, representing how much attention a key gets.    -- The Transformer - Attention is all you need.

    

  • Positional Encoding: Add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks. In this work, they use sine and cosine functions of different frequencies:

    • PE(pos, 2i) = sin ( pos / 100002i/dmodel )
    • PE(pos, 2i+1) = cos ( pos / 100002i/dmodel)

Improvement:

  • Position-wise Feed-Forward Networks: In addition to attention sub-layers, each of the layers in their encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between. FFN(x) = max(0, xW1 + b1)W2 + b2.

General Points:

  • An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
  • Why Self-Attention:
    • One is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.
    • The third is the path length between long-range dependencies in the network.

Paper Reading - Attention Is All You Need ( NIPS 2017 ) ★的更多相关文章

  1. Paper Reading - Convolutional Sequence to Sequence Learning ( CoRR 2017 ) ★

    Link of the Paper: https://arxiv.org/abs/1705.03122 Motivation: Compared to recurrent layers, convol ...

  2. Paper Reading: Stereo DSO

    开篇第一篇就写一个paper reading吧,用markdown+vim写东西切换中英文挺麻烦的,有些就偷懒都用英文写了. Stereo DSO: Large-Scale Direct Sparse ...

  3. Paper Reading - Im2Text: Describing Images Using 1 Million Captioned Photographs ( NIPS 2011 )

    Link of the Paper: http://papers.nips.cc/paper/4470-im2text-describing-images-using-1-million-captio ...

  4. Paper Reading - Show, Attend and Tell: Neural Image Caption Generation with Visual Attention ( ICML 2015 )

    Link of the Paper: https://arxiv.org/pdf/1502.03044.pdf Main Points: Encoder-Decoder Framework: Enco ...

  5. Paper Reading - Sequence to Sequence Learning with Neural Networks ( NIPS 2014 )

    Link of the Paper: https://arxiv.org/pdf/1409.3215.pdf Main Points: Encoder-Decoder Model: Input seq ...

  6. [Paper Reading] Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

    论文链接:https://arxiv.org/pdf/1502.03044.pdf 代码链接:https://github.com/kelvinxu/arctic-captions & htt ...

  7. Paper Reading - CNN+CNN: Convolutional Decoders for Image Captioning

    Link of the Paper: https://arxiv.org/abs/1805.09019 Innovations: The authors propose a CNN + CNN fra ...

  8. Paper Reading - Learning to Evaluate Image Captioning ( CVPR 2018 ) ★

    Link of the Paper: https://arxiv.org/abs/1806.06422 Innovations: The authors propose a novel learnin ...

  9. Paper Reading - Convolutional Image Captioning ( CVPR 2018 )

    Link of the Paper: https://arxiv.org/abs/1711.09151 Motivation: LSTM units are complex and inherentl ...

随机推荐

  1. Xcode官方xip直接离线下载地址(更新到Xcode 9.4.1)

    Xcode 9.4.1 https://download.developer.apple.com/Developer_Tools/Xcode_9.4.1/Xcode_9.4.1.xip Xcode 9 ...

  2. Redis简单介绍与数据类型

    介绍 分布式缓存 NoSql:解决高并发.高可用.高可扩展,大数据存储等一系列问题而产生的数据库解决方案. Redis:键值(Key-Value)存储数据库 Redis是使用c语言开发的一个高性能键值 ...

  3. PICT工具一键生成正交试验用例

    PICT工具一键生成正交试验用例 作用: 1.解决手动设计大量测试用例.或覆盖不全面问题,提高测试效率 2.读取excel,将生成的参数组合自动带入脚本,进行接口自动化测试 一.PICT简介 PICT ...

  4. Linux Mint 使用 VNC Server (x11vnc) 进行远程屏幕

    https://community.linuxmint.com/tutorial/view/2334 This tutorial was adapted from here. Remove the d ...

  5. 柱体内温度分布图 MATLAB

    对于下底面和侧面绝热,上底面温度与半径平方成正比的柱体,绘制柱体内温度分布图. 这里给出两种尝试:1.散点图:2.切片云图 1. 散点图仿真 首先使用解析算法求的场解值的解析表达,其次求解Bessel ...

  6. 20155227 2016-2017-2 《Java程序设计》第六周学习总结

    20155227 2016-2017-2 <Java程序设计>第六周学习总结 教材学习内容总结 InputStream与OutputStream 串流设计 流(Stream)是对「输入输出 ...

  7. 20145226夏艺华 《Java程序设计》预备作业3

    安装虚拟机 上学期开学的时候就安装了Linux虚拟机,由于我的是Mac OS,所以和windows下的安装有所不同. 我使用的是VirtualBoxVM虚拟机,稳定性还不错,需要的同学可以从https ...

  8. Tomcat设置是否可以上传文件到服务器

    今天,我做的一个点菜项目要求做一个添加菜品,把菜品的路径保存进数据库,然后将菜品的图片保存进tomcat相应的目录中. 一开始,我在客户端写的代码是直接向tomcat的目录写文件,但是会出现403错误 ...

  9. java随机数的生成

    我们经常会用到随机数的生成,作为唯一性的id或者标识: long now = System.currentTimeMillis(); SimpleDateFormat dateFormat=new S ...

  10. 【LOJ4632】[PKUSC2018]真实排名

    [LOJ4632][PKUSC2018]真实排名 题面 终于有题面啦!!! 题目描述 小 C 是某知名比赛的组织者,该比赛一共有 \(n\) 名选手参加,每个选手的成绩是一个非负整数,定义一个选手的排 ...