训练时:
1. 输入正确标签一次性解码出来

预测时:
1. 第一次输入1个词,解码出一个词
第二次输入第一次输入的词和第一次解码出来词一起,解码出来第3个词,这样依次解码,解码到最长的长度或者<pad>。就结束。
训练时,全部输入与预测时一个一个输入是一样的

1. 需要传入词向量

  1. def __init__(self, hp):
  2. self.hp = hp
  3. self.token2idx, self.idx2token = load_vocab(hp.vocab) # 这里在实际的需求情况下传入自己的词典
  4. self.embeddings = get_token_embeddings(self.hp.vocab_size, self.hp.d_model, zero_pad=True) # 这里作者使用定义的变量训练的词向量,在实际的生产过程当中,我们可以使用word2vec、bert等

2.position_encoding

  1. def positional_encoding(inputs,
  2. num_units,
  3. zero_pad=True,
  4. scale=True,
  5. scope="positional_encoding",
  6. reuse=None):
  7. '''Sinusoidal Positional_Encoding.
  8.  
  9. Args:
  10. inputs: A 2d Tensor with shape of (N, T).
  11. num_units: Output dimensionality
  12. zero_pad: Boolean. If True, all the values of the first row (id = 0) should be constant zero
  13. scale: Boolean. If True, the output will be multiplied by sqrt num_units(check details from paper)
  14. scope: Optional scope for `variable_scope`.
  15. reuse: Boolean, whether to reuse the weights of a previous layer
  16. by the same name.
  17.  
  18. Returns:
  19. A 'Tensor' with one more rank than inputs's, with the dimensionality should be 'num_units'
  20. '''
  21.  
  22. N, T = inputs.get_shape().as_list()
  23. with tf.variable_scope(scope, reuse=reuse):
  24. position_ind = tf.tile(tf.expand_dims(tf.range(T), 0), [N, 1])
  25.  
  26. # First part of the PE function: sin and cos argument
  27. position_enc = np.array([
  28. [pos / np.power(10000, 2.*i/num_units) for i in range(num_units)]
  29. for pos in range(T)])
  30.  
  31. # Second part, apply the cosine to even columns and sin to odds.
  32. position_enc[:, 0::2] = np.sin(position_enc[:, 0::2]) # dim 2i
  33. position_enc[:, 1::2] = np.cos(position_enc[:, 1::2]) # dim 2i+1
  34.  
  35. # Convert to a tensor
  36. lookup_table = tf.convert_to_tensor(position_enc)
  37.  
  38. if zero_pad:
  39. lookup_table = tf.concat((tf.zeros(shape=[1, num_units]),
  40. lookup_table[1:, :]), 0)
  41. outputs = tf.nn.embedding_lookup(lookup_table, position_ind)
  42.  
  43. if scale:
  44. outputs = outputs * num_units**0.5
  45.  
  46. return outputs

3. multihead_attention

  1. def multihead_attention(queries,
  2. keys,
  3. num_units=None,
  4. num_heads=8,
  5. dropout_rate=0,
  6. is_training=True,
  7. causality=False,
  8. scope="multihead_attention",
  9. reuse=None):
  10. '''Applies multihead attention.
  11.  
  12. Args:
  13. queries: A 3d tensor with shape of [N, T_q, C_q].
  14. keys: A 3d tensor with shape of [N, T_k, C_k].
  15. num_units: A scalar. Attention size.
  16. dropout_rate: A floating point number.
  17. is_training: Boolean. Controller of mechanism for dropout.
  18. causality: Boolean. If true, units that reference the future are masked.
  19. num_heads: An int. Number of heads.
  20. scope: Optional scope for `variable_scope`.
  21. reuse: Boolean, whether to reuse the weights of a previous layer
  22. by the same name.
  23.  
  24. Returns
  25. A 3d tensor with shape of (N, T_q, C)
  26. '''
  27. with tf.variable_scope(scope, reuse=reuse):
  28. # Set the fall back option for num_units
  29. if num_units is None:
  30. num_units = queries.get_shape().as_list()[-1]
  31.  
  32. # Linear projections
  33. Q = tf.layers.dense(queries, num_units, activation=tf.nn.relu) # (N, T_q, C) C为num_units,本实现中未设定,故等于C_q
  34. K = tf.layers.dense(keys, num_units, activation=tf.nn.relu) # (N, T_k, C)
  35. V = tf.layers.dense(keys, num_units, activation=tf.nn.relu) # (N, T_k, C)
  36.  
  37. # Split and concat
  38. Q_ = tf.concat(tf.split(Q, num_heads, axis=2), axis=0) # (h*N, T_q, C/h)
  39. K_ = tf.concat(tf.split(K, num_heads, axis=2), axis=0) # (h*N, T_k, C/h)
  40. V_ = tf.concat(tf.split(V, num_heads, axis=2), axis=0) # (h*N, T_k, C/h)
  41.  
  42. # Multiplication
  43. outputs = tf.matmul(Q_, tf.transpose(K_, [0, 2, 1])) # (h*N, T_q, T_k)
  44.  
  45. # Scale
  46. outputs = outputs / (K_.get_shape().as_list()[-1] ** 0.5)
  47.  
  48. # Key Masking
  49. key_masks = tf.sign(tf.reduce_sum(tf.abs(keys), axis=-1)) # (N, T_k)
  50. key_masks = tf.tile(key_masks, [num_heads, 1]) # (h*N, -T_k)
  51. key_masks = tf.tile(tf.expand_dims(key_masks, 1), [1, tf.shape(queries)[1], 1]) # (h*N, T_q, T_k)
  52.  
  53. paddings = tf.ones_like(outputs)*(-2**32+1)
  54. b = tf.equal(key_masks, 0)
  55. """
  56. 然后定义一个和outputs同shape的paddings,该tensor每个值都设定的极小。用where函数比较,当对应位置的key_masks值为0也就是需要mask时,
  57. outputs的该值(attention score)设置为极小的值(利用paddings实现),否则保留原来的outputs值。
  58. 经过以上key mask操作之后outputs的shape仍为 (h*N, T_q, T_k),只是对应mask了的key的score变为很小的值。
  59. """
  60. outputs = tf.where(tf.equal(key_masks, 0), paddings, outputs) # (h*N, T_q, T_k)
  61.  
  62. # Causality = Future blinding
  63. if causality: # 是否忽略未来信息
  64. diag_vals = tf.ones_like(outputs[0, :, :]) # (T_q, T_k)
  65. tril = tf.linalg.LinearOperatorLowerTriangular(diag_vals).to_dense() # (T_q, T_k)
  66. masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(outputs)[0], 1, 1]) # (h*N, T_q, T_k)
  67.  
  68. paddings = tf.ones_like(masks)*(-2**32+1)
  69. outputs = tf.where(tf.equal(masks, 0), paddings, outputs) # (h*N, T_q, T_k)
  70.  
  71. # Activation
  72. outputs = tf.nn.softmax(outputs) # (h*N, T_q, T_k)
  73.  
  74. # Query Masking
  75. query_masks = tf.sign(tf.reduce_sum(tf.abs(queries), axis=-1)) # (N, T_q)
  76. query_masks = tf.tile(query_masks, [num_heads, 1]) # (h*N, T_q)
  77. query_masks = tf.tile(tf.expand_dims(query_masks, -1), [1, 1, tf.shape(keys)[1]]) # (h*N, T_q, T_k)
  78. outputs *= query_masks # broadcasting. (N, T_q, T_k)?注释有误,将C改成T_k
  79.  
  80. # Dropouts
  81. outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=tf.convert_to_tensor(is_training))
  82.  
  83. # Weighted sum
  84. outputs = tf.matmul(outputs, V_) # ( h*N, T_q, C/h)
  85.  
  86. # Restore shape
  87. outputs = tf.concat(tf.split(outputs, num_heads, axis=0), axis=2 ) # (N, T_q, C)
  88.  
  89. # Residual connection
  90. outputs += queries
  91.  
  92. # Normalize
  93. outputs = normalize(outputs) # (N, T_q, C)
  94.  
  95. return outputs

4. feedforward

  1. def feedforward(inputs,
  2. num_units=[2048, 512],
  3. scope="multihead_attention",
  4. reuse=None):
  5. '''Point-wise feed forward net.
  6.  
  7. Args:
  8. inputs: A 3d tensor with shape of [N, T, C].
  9. num_units: A list of two integers.
  10. scope: Optional scope for `variable_scope`.
  11. reuse: Boolean, whether to reuse the weights of a previous layer
  12. by the same name.
  13.  
  14. Returns:
  15. A 3d tensor with the same shape and dtype as inputs
  16. '''
  17. with tf.variable_scope(scope, reuse=reuse):
  18. # Inner layer
  19. params = {"inputs": inputs, "filters": num_units[0], "kernel_size": 1,
  20. "activation": tf.nn.relu, "use_bias": True}
  21. outputs = tf.layers.conv1d(**params)
  22.  
  23. # Readout layer
  24. params = {"inputs": outputs, "filters": num_units[1], "kernel_size": 1,
  25. "activation": None, "use_bias": True}
  26. outputs = tf.layers.conv1d(**params)
  27.  
  28. # Residual connection
  29. outputs += inputs
  30.  
  31. # Normalize
  32. outputs = normalize(outputs)
  33.  
  34. return outputs

5.normalize

  1. def normalize(inputs,
  2. epsilon = 1e-8,
  3. scope="ln",
  4. reuse=None):
  5. '''Applies layer normalization.
  6.  
  7. Args:
  8. inputs: A tensor with 2 or more dimensions, where the first dimension has
  9. `batch_size`.
  10. epsilon: A floating number. A very small number for preventing ZeroDivision Error.
  11. scope: Optional scope for `variable_scope`.
  12. reuse: Boolean, whether to reuse the weights of a previous layer
  13. by the same name.
  14.  
  15. Returns:
  16. A tensor with the same shape and data dtype as `inputs`.
  17. '''
  18. with tf.variable_scope(scope, reuse=reuse):
  19. inputs_shape = inputs.get_shape()
  20. params_shape = inputs_shape[-1:]
  21.  
  22. mean, variance = tf.nn.moments(inputs, [-1], keep_dims=True)
  23. beta= tf.Variable(tf.zeros(params_shape))
  24. gamma = tf.Variable(tf.ones(params_shape))
  25. normalized = (inputs - mean) / ( (variance + epsilon) ** (.5) )
  26. outputs = gamma * normalized + beta
  27.  
  28. return outputs

6. encoder-decoder

  1. with tf.variable_scope("encoder"):
  2. ## Embedding
  3. self.enc = embedding(self.x,
  4. vocab_size=len(de2idx),
  5. num_units=hp.hidden_units,
  6. scale=True,
  7. scope="enc_embed")
  8.  
  9. # key_masks = tf.expand_dims(tf.sign(tf.reduce_sum(tf.abs(self.enc), axis=-1)), -1)
  10.  
  11. ## Positional Encoding
  12. if hp.sinusoid:
  13. self.enc += tf.cast(positional_encoding(self.x,
  14. num_units=hp.hidden_units,
  15. zero_pad=False,
  16. scale=False,
  17. scope="enc_pe"), tf.float32)
  18. else:
  19. self.enc += embedding(tf.tile(tf.expand_dims(tf.range(tf.shape(self.x)[1]), 0), [tf.shape(self.x)[0], 1]),
  20. vocab_size=hp.maxlen,
  21. num_units=hp.hidden_units,
  22. zero_pad=False,
  23. scale=False,
  24. scope="enc_pe")
  25.  
  26. # self.enc *= key_masks
  27.  
  28. ## Dropout
  29. self.enc = tf.layers.dropout(self.enc,
  30. rate=hp.dropout_rate,
  31. training=tf.convert_to_tensor(is_training))
  32.  
  33. ## Blocks
  34. for i in range(hp.num_blocks):
  35. with tf.variable_scope("num_blocks_{}".format(i)):
  36. ### Multihead Attention
  37. self.enc = multihead_attention(queries=self.enc,
  38. keys=self.enc,
  39. num_units=hp.hidden_units,
  40. num_heads=hp.num_heads,
  41. dropout_rate=hp.dropout_rate,
  42. is_training=is_training,
  43. causality=False)
  44.  
  45. ### Feed Forward
  46. self.enc = feedforward(self.enc, num_units=[4*hp.hidden_units, hp.hidden_units])
  47.  
  48. # Decoder
  49. with tf.variable_scope("decoder"):
  50. ## Embedding
  51. self.dec = embedding(self.decoder_inputs,
  52. vocab_size=len(en2idx),
  53. num_units=hp.hidden_units,
  54. scale=True,
  55. scope="dec_embed")
  56. self.dec_ = self.dec
  57.  
  58. # key_masks = tf.expand_dims(tf.sign(tf.reduce_sum(tf.abs(self.dec), axis=-1)), -1)
  59.  
  60. ## Positional Encoding
  61. if hp.sinusoid:
  62. self.dec += tf.cast(positional_encoding(self.decoder_inputs,
  63. num_units=hp.hidden_units,
  64. zero_pad=False,
  65. scale=False,
  66. scope="dec_pe"), tf.float32)
  67. else:
  68. self.dec += embedding(tf.tile(tf.expand_dims(tf.range(tf.shape(self.decoder_inputs)[1]), 0), [tf.shape(self.decoder_inputs)[0], 1]),
  69. vocab_size=hp.maxlen,
  70. num_units=hp.hidden_units,
  71. zero_pad=False,
  72. scale=False,
  73. scope="dec_pe")
  74. # self.dec *= key_masks
  75.  
  76. ## Dropout
  77. self.dec = tf.layers.dropout(self.dec,
  78. rate=hp.dropout_rate,
  79. training=tf.convert_to_tensor(is_training))
  80.  
  81. ## Blocks
  82. for i in range(hp.num_blocks):
  83. with tf.variable_scope("num_blocks_{}".format(i)):
  84. ## Multihead Attention ( self-attention)
  85. self.dec = multihead_attention(queries=self.dec,
  86. keys=self.dec,
  87. num_units=hp.hidden_units,
  88. num_heads=hp.num_heads,
  89. dropout_rate=hp.dropout_rate,
  90. is_training=is_training,
  91. causality=True,
  92. scope="self_attention")
  93.  
  94. ## Multihead Attention ( vanilla attention)
  95. self.dec = multihead_attention(queries=self.dec,
  96. keys=self.enc,
  97. num_units=hp.hidden_units,
  98. num_heads=hp.num_heads,
  99. dropout_rate=hp.dropout_rate,
  100. is_training=is_training,
  101. causality=False,
  102. scope="vanilla_attention")
  103. ## Feed Forward
  104. self.dec = feedforward(self.dec, num_units=[4*hp.hidden_units, hp.hidden_units])
  1. # Final linear projection
    self.logits = tf.layers.dense(self.dec, len(en2idx))
    self.preds = tf.to_int32(tf.arg_max(self.logits, dimension=-1))
    self.istarget = tf.to_float(tf.not_equal(self.y, 0))
    self.acc = tf.reduce_sum(tf.to_float(tf.equal(self.preds, self.y))*self.istarget) / (tf.reduce_sum(self.istarget))
    tf.summary.scalar('acc', self.acc)

7. train

  1. if is_training:
  2. # Loss
  3. self.y_smoothed = label_smoothing(tf.one_hot(self.y, depth=len(en2idx)))
  4. self.loss = tf.nn.softmax_cross_entropy_with_logits(logits=self.logits, labels=self.y_smoothed)
  5. self.mean_loss = tf.reduce_sum(self.loss*self.istarget) / (tf.reduce_sum(self.istarget))
  6.  
  7. # Training Scheme
  8. self.global_step = tf.Variable(0, name='global_step', trainable=False)
  9. self.optimizer = tf.train.AdamOptimizer(learning_rate=hp.lr, beta1=0.9, beta2=0.98, epsilon=1e-8)
  10. self.train_op = self.optimizer.minimize(self.mean_loss, global_step=self.global_step)
  11.  
  12. # Summary
  13. tf.summary.scalar('mean_loss', self.mean_loss)
  14. self.merged = tf.summary.merge_all()

transformer 源码的更多相关文章

  1. Flipboard-BottomSheetlayout 源码分析

    BottomSheetLayout BottomSheet:Google在API 23中已经加入了这样的一个控件. BottomSheet介绍: BottomSheet是一个可以从底部飞入和飞出的An ...

  2. angular源码分析:angular中各种常用函数,比较省代码的各种小技巧

    angular的工具函数 在angular的API文档中,在最前面就是讲的就是angular的工具函数,下面列出来 angular.bind //用户将函数和对象绑定在一起,返回一个新的函数 angu ...

  3. Volley源码分析(2)----ImageLoader

    一:imageLoader 先来看看如何使用imageloader: public void showImg(View view){ ImageView imageView = (ImageView) ...

  4. Java文件操作源码大全

    Java文件操作源码大全 1.创建文件夹 52.创建文件 53.删除文件 54.删除文件夹 65.删除一个文件下夹所有的文件夹 76.清空文件夹 87.读取文件 88.写入文件 99.写入随机文件 9 ...

  5. 玩转Android之Picasso使用详详详详详详解,从入门到源码剖析!!!!

    Picasso是Squareup公司出的一款图片加载框架,能够解决我们在Android开发中加载图片时遇到的诸多问题,比如OOM,图片错位等,问题主要集中在加载图片列表时,因为单张图片加载谁都会写.如 ...

  6. SpringMVC源码情操陶冶-ResourcesBeanDefinitionParser静态资源解析器

    解析mvc:resources节点,控制对静态资源的映射访问 查看官方注释 /** * {@link org.springframework.beans.factory.xml.BeanDefinit ...

  7. RxJava系列6(从微观角度解读RxJava源码)

    RxJava系列1(简介) RxJava系列2(基本概念及使用介绍) RxJava系列3(转换操作符) RxJava系列4(过滤操作符) RxJava系列5(组合操作符) RxJava系列6(从微观角 ...

  8. 【算法】Bert预训练源码阅读

    Bert预训练源码 主要代码 地址:https://github.com/google-research/bert create_pretraning_data.py:原始文件转换为训练数据格式 to ...

  9. vuex2.0源码分析

    当我们用vue在开发的过程中,经常会遇到以下问题 多个vue组件共享状态 Vue组件间的通讯 在项目不复杂的时候,我们会利用全局事件bus的方式解决,但随着复杂度的提升,用这种方式将会使得代码难以维护 ...

随机推荐

  1. React-router4 简单总结

    官方文档读到这里,大概明白了React-router是专门为单页面设计的,,我只能说多页面格外的不方便 首先这个是基本的套路 import React from 'react' import Reac ...

  2. 选择困难症的福音——团队Scrum冲刺阶段-Day 3

    选择困难症的福音--团队Scrum冲刺阶段-Day 3 今日进展 编写提问部分 做了不同问题所对应的游戏选项,但关于游戏分类的界面还没有做完 登陆注册界面 更改ui界面,ui界面终于变好看了:) 学习 ...

  3. AI制作icon标准参考线与多面板复制

    新建10个25x25像素,色值为RGB的画板 在视图中打开显示网格 打开首选项参考线和网格,间隔和隔线都设为1 新建一个20x20像素前景色为空描边为1像素的正方形 选择对齐选项中的对齐画板,使之与画 ...

  4. Django学习笔记(基础篇)

    Django学习笔记(基础篇):http://www.cnblogs.com/wupeiqi/articles/5237704.html

  5. frp使用笔记

    参考文档: https://github.com/fatedier/frp/blob/master/README_zh.md#%E9%80%9A%E8%BF%87-frpc-%E6%89%80%E5% ...

  6. Python基础(六)

  7. 团队-爬取豆瓣电影TOP250-开发环境搭建过程

    从官网下载安装包(http://www.python.org). 安装Python 选择安装路径(我选的默认) 安装Pycharm 1.从官网下载安装包(https://www.jetbrains.c ...

  8. 甲方安全建设之office365邮箱弱口令检测

    甲方安全建设之office365邮箱弱口令检测 信息收集 资产范围 资产列表总数是521 抓包后发现只有102 一番测试之后发现控制Response的关键在于MaxEntriesReturned字段, ...

  9. 其于OpenXml SDK写的帮助类

    /// <summary> /// 其于OpenXml SDK写的帮助类 /// </summary> public static class OpenXmlHelper { ...

  10. Flask实例化的参数 及 对app的配置

    首先展示一下: from flask import Flask app = Flask(__name__) # type:Flask app.config["DEBUG"] = T ...