TensorFlow使用记录 (八): 梯度修剪 和 Max-Norm Regularization
梯度修剪
梯度修剪主要避免训练梯度爆炸的问题,一般来说使用了 Batch Normalization 就不必要使用梯度修剪了,但还是有必要理解下实现的
In TensorFlow, the optimizer’s minimize() function takes care of both computing the gradients and applying them, so you must instead call the optimizer’s compute_gradients() method first, then create an operation to clip the gradients using the clip_by_value() function, and finally create an operation to apply the clipped gradients using the optimizer’s apply_gradients() method:
threshold = 1.0
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(loss)
capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
for grad, var in grads_and_vars]
training_op = optimizer.apply_gradients(capped_gvs)
例子:
import tensorflow as tf def Swish(features):
return features*tf.nn.sigmoid(features) # 1. create data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('../MNIST_data', one_hot=True) X = tf.placeholder(tf.float32, shape=(None, 784), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')
is_training = tf.placeholder(tf.bool, None, name='is_training') # 2. define network
he_init = tf.contrib.layers.variance_scaling_initializer()
with tf.name_scope('dnn'):
hidden1 = tf.layers.dense(X, 300, kernel_initializer=he_init, name='hidden1')
# hidden1 = tf.layers.batch_normalization(hidden1, momentum=0.9)
hidden1 = tf.nn.relu(hidden1)
hidden2 = tf.layers.dense(hidden1, 100, kernel_initializer=he_init, name='hidden2')
# hidden2 = tf.layers.batch_normalization(hidden2, training=is_training, momentum=0.9)
hidden2 = tf.nn.relu(hidden2)
logits = tf.layers.dense(hidden2, 10, kernel_initializer=he_init, name='output')
# prob = tf.layers.dense(hidden2, 10, tf.nn.softmax, name='prob') # 3. define loss
with tf.name_scope('loss'):
# tf.losses.sparse_softmax_cross_entropy() label is not one_hot and dtype is int*
# xentropy = tf.losses.sparse_softmax_cross_entropy(labels=tf.argmax(y, axis=1), logits=logits)
# tf.nn.sparse_softmax_cross_entropy_with_logits() label is not one_hot and dtype is int*
# xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=tf.argmax(y, axis=1), logits=logits)
# loss = tf.reduce_mean(xentropy)
loss = tf.losses.softmax_cross_entropy(onehot_labels=y, logits=logits) # label is one_hot # 4. define optimizer
learning_rate = 0.01
with tf.name_scope('train'):
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) # for batch normalization
with tf.control_dependencies(update_ops):
# optimizer_op = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
threshold = 1.0
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(loss)
capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
for grad, var in grads_and_vars]
optimizer_op = optimizer.apply_gradients(capped_gvs) with tf.name_scope('eval'):
correct = tf.nn.in_top_k(logits, tf.argmax(y, axis=1), 1) # 目标是否在前K个预测中, label's dtype is int*
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32)) # 5. initialize
init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
saver = tf.train.Saver()
# =================
print([v.name for v in tf.trainable_variables()])
print([v.name for v in tf.global_variables()])
# =================
# 5. train & test
n_epochs = 20
n_batches = 50
batch_size = 50 with tf.Session() as sess:
sess.run(init_op)
for epoch in range(n_epochs):
for iteration in range(mnist.train.num_examples // batch_size):
X_batch, y_batch = mnist.train.next_batch(batch_size)
sess.run(optimizer_op, feed_dict={X: X_batch, y: y_batch, is_training:True})
# =================
# for grad, var in grads_and_vars:
# grad = grad.eval(feed_dict={X: X_batch, y: y_batch, is_training:True})
# var = var.eval()
# =================
acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch, is_training:False}) # 最后一个 batch 的 accuracy
acc_test = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels, is_training:False})
loss_test = loss.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels, is_training:False})
print(epoch, "Train accuracy:", acc_train, "Test accuracy:", acc_test, "Test loss:", loss_test)
save_path = saver.save(sess, "./my_model_final.ckpt") with tf.Session() as sess:
sess.run(init_op)
saver.restore(sess, "./my_model_final.ckpt")
acc_test = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels, is_training:False})
loss_test = loss.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels, is_training:False})
print("Test accuracy:", acc_test, ", Test loss:", loss_test)
下面我们来看看上面这个例子里所涉及的一些东西
compute_gradients
compute_gradients 是任何一个优化器都有的方法:
compute_gradients(
loss,
var_list=None,
gate_gradients=GATE_OP,
aggregation_method=None,
colocate_gradients_with_ops=False,
grad_loss=None
)
计算 loss 中可训练的 var_list 中的梯度。
相当于minimize() 的第一步,返回 (gradient, variable) 列表。
获得了梯度后我们就可以手动进行梯度裁剪了,下面这句话就是将梯度限制到 [-threshold, threshold] 的范围内:
capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
for grad, var in grads_and_vars]
apply_gradients
apply_gradients 同样是任何一个优化器都有的方法:
apply_gradients(
grads_and_vars,
global_step=None,
name=None
)
minimize() 的第二部分,返回一个执行梯度更新的 ops。
Max-Norm Regularization
对于每个节点,max-norm regularization 会对权重 $\mathbf{w}$ 进行限制 $\lVert \mathbf{w} \rVert_2 \le r$:
\begin{equation}
\label{a}
\mathbf{w} \gets \mathbf{w} \frac{r}{\lVert \mathbf{w} \rVert_2}
\end{equation}
实例代码:
import tensorflow as tf # =================
def max_norm_regularizer(threshold=1.0, axes=1, name="max_norm",
collection="max_norm"):
def max_norm(weights):
clipped = tf.clip_by_norm(weights, clip_norm=threshold, axes=axes)
clip_weights = tf.assign(weights, clipped, name=name)
tf.add_to_collection(collection, clip_weights)
return None # there is no regularization loss term
return max_norm
max_norm_reg = max_norm_regularizer(threshold=1.0)
# ================= # 1. create data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('../MNIST_data', one_hot=True) X = tf.placeholder(tf.float32, shape=(None, 784), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')
is_training = tf.placeholder(tf.bool, None, name='is_training') # 2. define network
he_init = tf.contrib.layers.variance_scaling_initializer()
with tf.name_scope('dnn'):
hidden1 = tf.layers.dense(X, 300, kernel_initializer=he_init,
kernel_regularizer=max_norm_reg, name='hidden1')
# hidden1 = tf.layers.batch_normalization(hidden1, momentum=0.9)
hidden1 = tf.nn.relu(hidden1)
hidden2 = tf.layers.dense(hidden1, 100, kernel_initializer=he_init,
kernel_regularizer=max_norm_reg, name='hidden2')
# hidden2 = tf.layers.batch_normalization(hidden2, training=is_training, momentum=0.9)
hidden2 = tf.nn.relu(hidden2)
logits = tf.layers.dense(hidden2, 10, kernel_initializer=he_init, name='output') # 3. define loss
with tf.name_scope('loss'):
loss = tf.losses.softmax_cross_entropy(onehot_labels=y, logits=logits) # label is one_hot # 4. define optimizer
learning_rate_init = 0.01
global_step = tf.Variable(0, trainable=False)
with tf.name_scope('train'):
learning_rate = tf.train.polynomial_decay( # 多项式衰减
learning_rate=learning_rate_init, # 初始学习率
global_step=global_step, # 当前迭代次数
decay_steps=22000, # 在迭代到该次数实际,学习率衰减为 learning_rate * dacay_rate
end_learning_rate=learning_rate_init / 10, # 最小的学习率
power=0.9,
cycle=False
)
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) # for batch normalization
with tf.control_dependencies(update_ops):
optimizer_op = tf.train.MomentumOptimizer(
learning_rate=learning_rate, momentum=0.9).minimize(
loss=loss,
var_list=tf.trainable_variables(),
global_step=global_step # 不指定的话学习率不更新
)
# ================= clip gradient
# threshold = 1.0
# optimizer = tf.train.GradientDescentOptimizer(learning_rate)
# grads_and_vars = optimizer.compute_gradients(loss)
# capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
# for grad, var in grads_and_vars]
# optimizer_op = optimizer.apply_gradients(capped_gvs)
# ================= with tf.name_scope('eval'):
correct = tf.nn.in_top_k(logits, tf.argmax(y, axis=1), 1) # 目标是否在前K个预测中, label's dtype is int*
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32)) # 5. initialize
init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
saver = tf.train.Saver() # =================
clip_all_weights = tf.get_collection("max_norm")
# ================= # 6. train & test
n_epochs = 20
batch_size = 50 with tf.Session() as sess:
sess.run(init_op)
# saver.restore(sess, './my_model_final.ckpt')
for epoch in range(n_epochs):
for iteration in range(mnist.train.num_examples // batch_size):
X_batch, y_batch = mnist.train.next_batch(batch_size)
sess.run([optimizer_op, learning_rate], feed_dict={X: X_batch, y: y_batch, is_training:True})
sess.run(clip_all_weights)
# ================= check gradient
# for grad, var in grads_and_vars:
# grad = grad.eval(feed_dict={X: X_batch, y: y_batch, is_training:True})
# var = var.eval()
# =================
learning_rate_cur = learning_rate.eval()
acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch, is_training:False}) # 最后一个 batch 的 accuracy
acc_test = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels, is_training:False})
loss_test = loss.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels, is_training:False})
print(epoch, "Current learning rate:", learning_rate_cur, "Train accuracy:", acc_train, "Test accuracy:", acc_test, "Test loss:", loss_test)
save_path = saver.save(sess, "./my_model_final.ckpt")
TensorFlow使用记录 (八): 梯度修剪 和 Max-Norm Regularization的更多相关文章
- TensorFlow使用记录 (六): 优化器
0. tf.train.Optimizer tensorflow 里提供了丰富的优化器,这些优化器都继承与 Optimizer 这个类.class Optimizer 有一些方法,这里简单介绍下: 0 ...
- TensorFlow 学习(八)—— 梯度计算(gradient computation)
maxpooling 的 max 函数关于某变量的偏导也是分段的,关于它就是 1,不关于它就是 0: BP 是反向传播求关于参数的偏导,SGD 则是梯度更新,是优化算法: 1. 一个实例 relu = ...
- 『PyTorch x TensorFlow』第八弹_基本nn.Module层函数
『TensorFlow』网络操作API_上 『TensorFlow』网络操作API_中 『TensorFlow』网络操作API_下 之前也说过,tf 和 t 的层本质区别就是 tf 的是层函数,调用即 ...
- Tensorflow安装记录
一.安装Ubantu环境 下载ios 网址:http://cn.ubuntu.com/download/ 2.配合虚拟机进行安装环境 虚拟机直接百度下载即可 虚拟机采用 具体安装,虚拟机百度中很多记录 ...
- linux 配置tensorflow 全过程记录
前几天刚下一个deepin系统,是基于linux 内核的,界面的设计有些mac的feel 感觉还是挺不错的,之后就赶紧配置了一下tensorflow ,尽管之前配置过,但是这次还是遇到点儿问题,所以说 ...
- TensorFlow使用记录 (七): BN 层及 Dropout 层的使用
参考:tensorflow中的batch_norm以及tf.control_dependencies和tf.GraphKeys.UPDATE_OPS的探究 1. Batch Normalization ...
- TensorFlow使用记录 (五): 激活函数和初始化方式
In general ELU > leaky ReLU(and its variants) > ReLU > tanh > logistic. If you care a lo ...
- TensorFlow实战第八课(卷积神经网络CNN)
首先我们来简单的了解一下什么是卷积神经网路(Convolutional Neural Network) 卷积神经网络是近些年逐步兴起的一种人工神经网络结构, 因为利用卷积神经网络在图像和语音识别方面能 ...
- TensorFlow学习记录(一)
windows下的安装: 首先访问https://storage.googleapis.com/tensorflow/ 找到对应操作系统下,对应python版本,对应python位数的whl,下载. ...
随机推荐
- LeetCode面试常见100题( TOP 100 Liked Questions)
LeetCode面试常见100题( TOP 100 Liked Questions) 置顶 2018年07月16日 11:25:22 lanyu_01 阅读数 9704更多 分类专栏: 面试编程题真题 ...
- Java 多线程创建和线程状态
一.进程和线程 多任务操作系统中,每个运行的任务是操作系统运行的独立程序. 为什么引进进程的概念? 为了使得程序能并发执行,并对并发执行的程序加以描述和控制. 因为通常的程序不能并发执行,为使程序(含 ...
- Java反射理解(五)-- 方法反射的基本操作
Java反射理解(五)-- 方法反射的基本操作 方法的反射 1. 如何获取某个方法 方法的名称和方法的参数列表才能唯一决定某个方法 2. 方法反射的操作 method.invoke(对象,参数列表) ...
- C++入门基础知识(一)
一:关键字 在C语言中,我们已经学习过了很多的关键字,例如:static,struct等,下面展现一下C++中的一些关键字. 二:命名空间 在C/C++中,变量.函数和类都是大量存在的,这些变量.函数 ...
- Java函数式接口
函数式接口定义且只定义了一个抽象方法.函数式接口的抽象方法的签名称为函数描述符.Java 8的java.util.function包中引入了几个新的函数式接口. 1.Predicate java.ut ...
- C#实现鼠标滚筒缩放界面的效果
elementCanvas继承UserControl 声明属性: #region 缩放属性添加 float ratio = 1.0f; public float Ratio { set { ratio ...
- C3.js入门案例
C3.js是基于D3.js开发的JavaScript库,它可以让开发者构建出可复用的图表,并且还提供了一系列图表上的交互行为.通过C3,只需要往generate函数中传入数据对象就可以轻松的绘制出图表 ...
- 15条SQLite3语句
15条SQLite3语句 转自:http://www.thegeekstuff.com/2012/09/sqlite-command-examples/ SQLite3 is very light ...
- vue 编辑
点击文字修改 <div class="baseInfo"> <p class="title">基本信息</p> <p ...
- 一点css 基础
css 行内样式优先度最高 margin 属性 为声明外边距 如图 顺序依次为上右下左