cs 231 Batch Normalization 求导推导及代码复现(BN,LN)
cs 231 Batch Normalization 求导推导及代码复现(BN,LN)
版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
cs 231 Batch Normalization 求导推导及代码复现:
作者论文公式:https://arxiv.org/abs/1502.03167
Batch Normalization 计算图:
Batch Normalization 求导数学推导:
Batch Normalization 对xi 三条路径最终推出的结果:
论文公式代码复现如下:
-
def batchnorm_forward(x, gamma, beta, bn_param):
-
"""
-
Forward pass for batch normalization.
-
-
During training the sample mean and (uncorrected) sample variance are
-
computed from minibatch statistics and used to normalize the incoming data.
-
During training we also keep an exponentially decaying running mean of the
-
mean and variance of each feature, and these averages are used to normalize
-
data at test-time.
-
-
At each timestep we update the running averages for mean and variance using
-
an exponential decay based on the momentum parameter:
-
-
running_mean = momentum * running_mean + (1 - momentum) * sample_mean
-
running_var = momentum * running_var + (1 - momentum) * sample_var
-
-
Note that the batch normalization paper suggests a different test-time
-
behavior: they compute sample mean and variance for each feature using a
-
large number of training images rather than using a running average. For
-
this implementation we have chosen to use running averages instead since
-
they do not require an additional estimation step; the torch7
-
implementation of batch normalization also uses running averages.
-
-
Input:
-
- x: Data of shape (N, D)
-
- gamma: Scale parameter of shape (D,)
-
- beta: Shift paremeter of shape (D,)
-
- bn_param: Dictionary with the following keys:
-
- mode: 'train' or 'test'; required
-
- eps: Constant for numeric stability
-
- momentum: Constant for running mean / variance.
-
- running_mean: Array of shape (D,) giving running mean of features
-
- running_var Array of shape (D,) giving running variance of features
-
-
Returns a tuple of:
-
- out: of shape (N, D)
-
- cache: A tuple of values needed in the backward pass
-
"""
-
mode = bn_param['mode']
-
eps = bn_param.get('eps', 1e-5)
-
momentum = bn_param.get('momentum', 0.9)
-
-
N, D = x.shape
-
running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))
-
running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))
-
-
out, cache = None, None
-
if mode == 'train':
-
#######################################################################
-
# TODO: Implement the training-time forward pass for batch norm. #
-
# Use minibatch statistics to compute the mean and variance, use #
-
# these statistics to normalize the incoming data, and scale and #
-
# shift the normalized data using gamma and beta. #
-
# #
-
# You should store the output in the variable out. Any intermediates #
-
# that you need for the backward pass should be stored in the cache #
-
# variable. #
-
# #
-
# You should also use your computed sample mean and variance together #
-
# with the momentum variable to update the running mean and running #
-
# variance, storing your result in the running_mean and running_var #
-
# variables. #
-
# #
-
# Note that though you should be keeping track of the running #
-
# variance, you should normalize the data based on the standard #
-
# deviation (square root of variance) instead! #
-
# Referencing the original paper (https://arxiv.org/abs/1502.03167) #
-
# might prove to be helpful. #
-
#######################################################################
-
-
#公式: https://arxiv.org/abs/1502.03167
-
mean_x = np.mean(x, axis = 0 )
-
var_x = np.var(x, axis = 0)
-
x_hat =( x - mean_x) / np.sqrt(var_x + eps )
-
out = gamma* x_hat + beta
-
running_mean = momentum * running_mean + (1 - momentum) * mean_x
-
running_var = momentum * running_var + (1 - momentum) * var_x
-
inv_var_x = 1 / np.sqrt(var_x + eps)
-
cache =(x,x_hat,gamma,mean_x,inv_var_x)
-
#######################################################################
-
# END OF YOUR CODE #
-
#######################################################################
-
elif mode == 'test':
-
#######################################################################
-
# TODO: Implement the test-time forward pass for batch normalization. #
-
# Use the running mean and variance to normalize the incoming data, #
-
# then scale and shift the normalized data using gamma and beta. #
-
# Store the result in the out variable. #
-
#######################################################################
-
-
x_hat =( x - running_mean) / np.sqrt(running_var + eps )
-
out = gamma* x_hat + beta
-
-
-
#######################################################################
-
# END OF YOUR CODE #
-
#######################################################################
-
else:
-
raise ValueError('Invalid forward batchnorm mode "%s"' % mode)
-
-
# Store the updated running means back into bn_param
-
bn_param['running_mean'] = running_mean
-
bn_param['running_var'] = running_var
-
-
return out, cache
-
-
-
def batchnorm_backward(dout, cache):
-
"""
-
Backward pass for batch normalization.
-
-
For this implementation, you should write out a computation graph for
-
batch normalization on paper and propagate gradients backward through
-
intermediate nodes.
-
-
Inputs:
-
- dout: Upstream derivatives, of shape (N, D)
-
- cache: Variable of intermediates from batchnorm_forward.
-
-
Returns a tuple of:
-
- dx: Gradient with respect to inputs x, of shape (N, D)
-
- dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
-
- dbeta: Gradient with respect to shift parameter beta, of shape (D,)
-
"""
-
dx, dgamma, dbeta = None, None, None
-
###########################################################################
-
# TODO: Implement the backward pass for batch normalization. Store the #
-
# results in the dx, dgamma, and dbeta variables. #
-
# Referencing the original paper (https://arxiv.org/abs/1502.03167) #
-
# might prove to be helpful. #
-
###########################################################################
-
# =============================================================================
-
# xi ----- uB----- o^2 B------------xi^--------------yi----------l
-
# xi-----
-
# ub---- gamma--
-
# xi---- betla--
-
# =============================================================================
-
x, x_hat, gamma, mu, inv_sigma = cache
-
x,x_hat,gamma,mean_x,inv_var_x = cache
-
N = x.shape[0]
-
# dx 求导合并:
-
#1: l--->xi^--->xi
-
dx= gamma * dout * inv_var_x
-
#2: l----> o^2 B--->xi
-
dx += (2 / N) * (x - mean_x) * np.sum(- (1/2) * inv_var_x ** 3 * (x - mean_x) * gamma * dout, axis=0)
-
-
#3: l----> uB--->xi
-
dx += (1 / N) * np.sum(-1 * inv_var_x * gamma * dout, axis=0)
-
-
# dgamma求导:l----> yi--->gamma
-
dgamma = np.sum(x_hat * dout, axis=0)
-
-
# dbeta求导:l----> yi--->betla
-
dbeta = np.sum(dout, axis=0)
-
-
-
-
-
###########################################################################
-
# END OF YOUR CODE #
-
###########################################################################
-
-
return dx, dgamma, dbeta
batchnorm_backward_alt
代码复现如下:
-
def batchnorm_backward_alt(dout, cache):
-
"""
-
Alternative backward pass for batch normalization.
-
-
For this implementation you should work out the derivatives for the batch
-
normalizaton backward pass on paper and simplify as much as possible. You
-
should be able to derive a simple expression for the backward pass.
-
See the jupyter notebook for more hints.
-
-
Note: This implementation should expect to receive the same cache variable
-
as batchnorm_backward, but might not use all of the values in the cache.
-
-
Inputs / outputs: Same as batchnorm_backward
-
"""
-
dx, dgamma, dbeta = None, None, None
-
###########################################################################
-
# TODO: Implement the backward pass for batch normalization. Store the #
-
# results in the dx, dgamma, and dbeta variables. #
-
# #
-
# After computing the gradient with respect to the centered inputs, you #
-
# should be able to compute gradients with respect to the inputs in a #
-
# single statement; our implementation fits on a single 80-character line.#
-
###########################################################################
-
-
x, x_hat, gamma, mean_x,inv_var_x = cache
-
N = x.shape[0]
-
dbeta = np.sum(dout, axis=0)
-
dgamma = np.sum(x_hat * dout, axis=0)
-
dxhat = dout * gamma
-
dx = (1. / N) * inv_var_x * (N * dxhat - np.sum(dxhat, axis=0) -
-
x_hat * np.sum(dxhat * x_hat, axis=0))
-
-
###########################################################################
-
# END OF YOUR CODE #
-
###########################################################################
-
-
return dx, dgamma, dbeta
layer normalization:
-
def layernorm_forward(x, gamma, beta, ln_param):
-
"""
-
Forward pass for layer normalization.
-
-
During both training and test-time, the incoming data is normalized per data-point,
-
before being scaled by gamma and beta parameters identical to that of batch normalization.
-
-
Note that in contrast to batch normalization, the behavior during train and test-time for
-
layer normalization are identical, and we do not need to keep track of running averages
-
of any sort.
-
-
Input:
-
- x: Data of shape (N, D)
-
- gamma: Scale parameter of shape (D,)
-
- beta: Shift paremeter of shape (D,)
-
- ln_param: Dictionary with the following keys:
-
- eps: Constant for numeric stability
-
-
Returns a tuple of:
-
- out: of shape (N, D)
-
- cache: A tuple of values needed in the backward pass
-
"""
-
out, cache = None, None
-
eps = ln_param.get('eps', 1e-5)
-
###########################################################################
-
# TODO: Implement the training-time forward pass for layer norm. #
-
# Normalize the incoming data, and scale and shift the normalized data #
-
# using gamma and beta. #
-
# HINT: this can be done by slightly modifying your training-time #
-
# implementation of batch normalization, and inserting a line or two of #
-
# well-placed code. In particular, can you think of any matrix #
-
# transformations you could perform, that would enable you to copy over #
-
# the batch norm code and leave it almost unchanged? #
-
###########################################################################
-
#x: (N, D) ---->(D,N)
-
x = x.T
-
mean_x = np.mean(x,axis =0)
-
var_x= np.var(x,axis = 0)
-
inv_var_x = 1 / np.sqrt(var_x + eps)
-
-
x_hat = (x - mean_x)/np.sqrt(var_x + eps) #(D,N)
-
x_hat = x_hat.T #(D,N)---->(N,D)
-
# gamma: (D,) beta: (D,)
-
out = gamma * x_hat + beta
-
cache =(x_hat,gamma,mean_x,inv_var_x)
-
-
-
###########################################################################
-
# END OF YOUR CODE #
-
###########################################################################
-
return out, cache
-
def layernorm_backward(dout, cache):
-
"""
-
Backward pass for layer normalization.
-
-
For this implementation, you can heavily rely on the work you've done already
-
for batch normalization.
-
-
Inputs:
-
- dout: Upstream derivatives, of shape (N, D)
-
- cache: Variable of intermediates from layernorm_forward.
-
-
Returns a tuple of:
-
- dx: Gradient with respect to inputs x, of shape (N, D)
-
- dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
-
- dbeta: Gradient with respect to shift parameter beta, of shape (D,)
-
"""
-
dx, dgamma, dbeta = None, None, None
-
###########################################################################
-
# TODO: Implement the backward pass for layer norm. #
-
# #
-
# HINT: this can be done by slightly modifying your training-time #
-
# implementation of batch normalization. The hints to the forward pass #
-
# still apply! #
-
###########################################################################
-
-
x, x_hat, gamma, mean_x,inv_var_x = cache
-
d = x.shape[0]
-
dbeta = np.sum(dout, axis=0)
-
dgamma = np.sum(x_hat * dout, axis=0)
-
dxhat = dout * gamma
-
dxhat = dxhat.T
-
x_hat = x_hat.T
-
dx = (1. / d) * inv_var_x * (d * dxhat - np.sum(dxhat, axis=0) -
-
x_hat * np.sum(dxhat * x_hat, axis=0))
-
dx = dx.T
-
###########################################################################
-
# END OF YOUR CODE #
-
###########################################################################
-
return dx, dgamma, dbeta