优化算法(Optimization algorithms)
1.Mini-batch 梯度下降(Mini-batch gradient descent)
batch gradient descent :一次迭代同时处理整个train data
Mini-batch gradient descent: 一次迭代处理单一的mini-batch (X{t} ,Y{t})
Choosing your mini-batch size : if train data m<2000 then batch ,else mini-batch=64~512 (2的n次方),需要多次尝试来确定mini-batch size
A variant of this is Stochastic Gradient Descent (SGD), which is equivalent to mini-batch gradient descent where each mini-batch has just 1 example. The update rule that you have just implemented does not change. What changes is that you would be computing gradients on just one training example at a time, rather than on the whole training set. The code examples below illustrate the difference between stochastic gradient descent and (batch) gradient descent.
- (Batch) Gradient Descent:
X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
# Forward propagation
a, caches = forward_propagation(X, parameters)
# Compute cost.
cost = compute_cost(a, Y)
# Backward propagation.
grads = backward_propagation(a, caches, parameters)
# Update parameters.
parameters = update_parameters(parameters, grads)
- Stochastic Gradient Descent:
X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
for j in range(0, m):
# Forward propagation
a, caches = forward_propagation(X[:,j], parameters)
# Compute cost
cost = compute_cost(a, Y[:,j])
# Backward propagation
grads = backward_propagation(a, caches, parameters)
# Update parameters.
parameters = update_parameters(parameters, grads)
1 def random_mini_batches(X, Y, mini_batch_size = 64, seed = 0): 2 """ 3 Creates a list of random minibatches from (X, Y) 4 5 Arguments: 6 X -- input data, of shape (input size, number of examples) 7 Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples) 8 mini_batch_size -- size of the mini-batches, integer 9 10 Returns: 11 mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y) 12 """ 13 14 np.random.seed(seed) # To make your "random" minibatches the same as ours 15 m = X.shape[1] # number of training examples 16 mini_batches = [] 17 18 # Step 1: Shuffle (X, Y) 19 permutation = list(np.random.permutation(m)) 20 shuffled_X = X[:, permutation] 21 shuffled_Y = Y[:, permutation].reshape((1,m)) 22 23 # Step 2: Partition (shuffled_X, shuffled_Y). Minus the end case. 24 num_complete_minibatches = math.floor(m/mini_batch_size) # number of mini batches of size mini_batch_size in your partitionning 25 for k in range(0, num_complete_minibatches): 26 ### START CODE HERE ### (approx. 2 lines) 27 mini_batch_X = shuffled_X[:,k*mini_batch_size:(k+1)*mini_batch_size] 28 mini_batch_Y = shuffled_Y[:,k*mini_batch_size:(k+1)*mini_batch_size] 29 ### END CODE HERE ### 30 mini_batch = (mini_batch_X, mini_batch_Y) 31 mini_batches.append(mini_batch) 32 33 # Handling the end case (last mini-batch < mini_batch_size) 34 if m % mini_batch_size != 0: 35 ### START CODE HERE ### (approx. 2 lines) 36 mini_batch_X =shuffled_X[:,(k+1)*mini_batch_size:m] 37 mini_batch_Y =shuffled_Y[:,(k+1)*mini_batch_size:m] 38 ### END CODE HERE ### 39 mini_batch = (mini_batch_X, mini_batch_Y) 40 mini_batches.append(mini_batch) 41 42 return mini_batches
2.指数加权平均数(Exponentially weighted averages):
指数加权平均数的公式:在计算时可视Vt 大概是1/(1-B)的每日温度,如果B是0.9,那么就是十天的平均值,当B较大时, 指数加权平均值适应更缓慢
指数加权平均的偏差修正:
3.动量梯度下降法(Gradinent descent with Momentum)
1 def initialize_velocity(parameters): 2 """ 3 Initializes the velocity as a python dictionary with: 4 - keys: "dW1", "db1", ..., "dWL", "dbL" 5 - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters. 6 Arguments: 7 parameters -- python dictionary containing your parameters. 8 parameters['W' + str(l)] = Wl 9 parameters['b' + str(l)] = bl 10 11 Returns: 12 v -- python dictionary containing the current velocity. 13 v['dW' + str(l)] = velocity of dWl 14 v['db' + str(l)] = velocity of dbl 15 """ 16 17 L = len(parameters) // 2 # number of layers in the neural networks 18 v = {} 19 20 # Initialize velocity 21 for l in range(L): 22 ### START CODE HERE ### (approx. 2 lines) 23 v["dW" + str(l+1)] = np.zeros(parameters["W"+str(l+1)].shape) 24 v["db" + str(l+1)] = np.zeros(parameters["b"+str(l+1)].shape) 25 ### END CODE HERE ### 26 27 return v
1 def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate): 2 """ 3 Update parameters using Momentum 4 5 Arguments: 6 parameters -- python dictionary containing your parameters: 7 parameters['W' + str(l)] = Wl 8 parameters['b' + str(l)] = bl 9 grads -- python dictionary containing your gradients for each parameters: 10 grads['dW' + str(l)] = dWl 11 grads['db' + str(l)] = dbl 12 v -- python dictionary containing the current velocity: 13 v['dW' + str(l)] = ... 14 v['db' + str(l)] = ... 15 beta -- the momentum hyperparameter, scalar 16 learning_rate -- the learning rate, scalar 17 18 Returns: 19 parameters -- python dictionary containing your updated parameters 20 v -- python dictionary containing your updated velocities 21 """ 22 23 L = len(parameters) // 2 # number of layers in the neural networks 24 25 # Momentum update for each parameter 26 for l in range(L): 27 28 ### START CODE HERE ### (approx. 4 lines) 29 # compute velocities 30 v["dW" + str(l+1)] = beta*v["dW" + str(l+1)]+(1-beta)*grads["dW" + str(l+1)] 31 v["db" + str(l+1)] = beta*v["db" + str(l+1)]+(1-beta)*grads["db" + str(l+1)] 32 # update parameters 33 parameters["W" + str(l+1)] = parameters["W" + str(l+1)]-learning_rate*v["dW" + str(l+1)] 34 parameters["b" + str(l+1)] = parameters["b" + str(l+1)]-learning_rate*v["db" + str(l+1)] 35 ### END CODE HERE ### 36 37 return parameters, v
#β=0.9 is often a reasonable default.
4.RMSprop算法(root mean square prop):
5.Adam 优化算法(Adam optimization algorithm):
Adam 优化算法基本上就是将Momentum 和RMSprop结合在一起
1 def initialize_adam(parameters) : 2 """ 3 Initializes v and s as two python dictionaries with: 4 - keys: "dW1", "db1", ..., "dWL", "dbL" 5 - values: numpy arrays of zeros of the same shape as the corresponding gradients/parameters. 6 7 Arguments: 8 parameters -- python dictionary containing your parameters. 9 parameters["W" + str(l)] = Wl 10 parameters["b" + str(l)] = bl 11 12 Returns: 13 v -- python dictionary that will contain the exponentially weighted average of the gradient. 14 v["dW" + str(l)] = ... 15 v["db" + str(l)] = ... 16 s -- python dictionary that will contain the exponentially weighted average of the squared gradient. 17 s["dW" + str(l)] = ... 18 s["db" + str(l)] = ... 19 20 """ 21 22 L = len(parameters) // 2 # number of layers in the neural networks 23 v = {} 24 s = {} 25 26 # Initialize v, s. Input: "parameters". Outputs: "v, s". 27 for l in range(L): 28 ### START CODE HERE ### (approx. 4 lines) 29 v["dW" + str(l+1)] = np.zeros(parameters["W" + str(l+1)].shape) 30 v["db" + str(l+1)] = np.zeros(parameters["b" + str(l+1)].shape) 31 s["dW" + str(l+1)] = np.zeros(parameters["W" + str(l+1)].shape) 32 s["db" + str(l+1)] = np.zeros(parameters["b" + str(l+1)].shape) 33 ### END CODE HERE ### 34 35 return v, s
1 def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate = 0.01, 2 beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8): 3 """ 4 Update parameters using Adam 5 6 Arguments: 7 parameters -- python dictionary containing your parameters: 8 parameters['W' + str(l)] = Wl 9 parameters['b' + str(l)] = bl 10 grads -- python dictionary containing your gradients for each parameters: 11 grads['dW' + str(l)] = dWl 12 grads['db' + str(l)] = dbl 13 v -- Adam variable, moving average of the first gradient, python dictionary 14 s -- Adam variable, moving average of the squared gradient, python dictionary 15 learning_rate -- the learning rate, scalar. 16 beta1 -- Exponential decay hyperparameter for the first moment estimates 17 beta2 -- Exponential decay hyperparameter for the second moment estimates 18 epsilon -- hyperparameter preventing division by zero in Adam updates 19 20 Returns: 21 parameters -- python dictionary containing your updated parameters 22 v -- Adam variable, moving average of the first gradient, python dictionary 23 s -- Adam variable, moving average of the squared gradient, python dictionary 24 """ 25 26 L = len(parameters) // 2 # number of layers in the neural networks 27 v_corrected = {} # Initializing first moment estimate, python dictionary 28 s_corrected = {} # Initializing second moment estimate, python dictionary 29 30 # Perform Adam update on all parameters 31 for l in range(L): 32 # Moving average of the gradients. Inputs: "v, grads, beta1". Output: "v". 33 ### START CODE HERE ### (approx. 2 lines) 34 v["dW" + str(l+1)] = beta1* v["dW" + str(l+1)]+(1-beta1)*grads["dW" + str(l+1)] 35 v["db" + str(l+1)] = beta1* v["db" + str(l+1)]+(1-beta1)*grads["db" + str(l+1)] 36 ### END CODE HERE ### 37 38 # Compute bias-corrected first moment estimate. Inputs: "v, beta1, t". Output: "v_corrected". 39 ### START CODE HERE ### (approx. 2 lines) 40 v_corrected["dW" + str(l+1)] = (v["dW" + str(l+1)])/(1-np.power(beta1,t)) 41 v_corrected["db" + str(l+1)] = (v["db" + str(l+1)])/(1-np.power(beta1,t)) 42 ### END CODE HERE ### 43 44 # Moving average of the squared gradients. Inputs: "s, grads, beta2". Output: "s". 45 ### START CODE HERE ### (approx. 2 lines) 46 s["dW" + str(l+1)] = beta2* s["dW" + str(l+1)]+(1-beta2)*np.power(grads["dW" + str(l+1)],2) 47 s["db" + str(l+1)] = beta2* s["db" + str(l+1)]+(1-beta2)*np.power(grads["db" + str(l+1)],2) 48 ### END CODE HERE ### 49 50 # Compute bias-corrected second raw moment estimate. Inputs: "s, beta2, t". Output: "s_corrected". 51 ### START CODE HERE ### (approx. 2 lines) 52 s_corrected["dW" + str(l+1)] = s["dW" + str(l+1)]/(1-np.power(beta2,t)) 53 s_corrected["db" + str(l+1)] = s["db" + str(l+1)]/(1-np.power(beta2,t)) 54 ### END CODE HERE ### 55 56 # Update parameters. Inputs: "parameters, learning_rate, v_corrected, s_corrected, epsilon". Output: "parameters". 57 ### START CODE HERE ### (approx. 2 lines) 58 parameters["W" + str(l+1)] = parameters["W" + str(l+1)]-learning_rate*v_corrected["dW" + str(l+1)]/(s_corrected["dW" + str(l+1)]+epsilon) 59 parameters["b" + str(l+1)] = parameters["b" + str(l+1)]-learning_rate*v_corrected["db" + str(l+1)]/(s_corrected["db" + str(l+1)]+epsilon) 60 ### END CODE HERE ### 61 62 return parameters, v, s
6.学习率衰减(Learning rate decay):
加快学习算法的一个办法就是随时间慢慢减少学习率,这样在学习初期,你能承受较大的步伐,当开始收敛的时候,小一些的学习率能让你步伐小一些。
综合练习:
1 def model(X, Y, layers_dims, optimizer, learning_rate = 0.0007, mini_batch_size = 64, beta = 0.9, 2 beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8, num_epochs = 10000, print_cost = True): 3 """ 4 3-layer neural network model which can be run in different optimizer modes. 5 6 Arguments: 7 X -- input data, of shape (2, number of examples) 8 Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples) 9 layers_dims -- python list, containing the size of each layer 10 learning_rate -- the learning rate, scalar. 11 mini_batch_size -- the size of a mini batch 12 beta -- Momentum hyperparameter 13 beta1 -- Exponential decay hyperparameter for the past gradients estimates 14 beta2 -- Exponential decay hyperparameter for the past squared gradients estimates 15 epsilon -- hyperparameter preventing division by zero in Adam updates 16 num_epochs -- number of epochs 17 print_cost -- True to print the cost every 1000 epochs 18 19 Returns: 20 parameters -- python dictionary containing your updated parameters 21 """ 22 23 L = len(layers_dims) # number of layers in the neural networks 24 costs = [] # to keep track of the cost 25 t = 0 # initializing the counter required for Adam update 26 seed = 10 # For grading purposes, so that your "random" minibatches are the same as ours 27 28 # Initialize parameters 29 parameters = initialize_parameters(layers_dims) 30 31 # Initialize the optimizer 32 if optimizer == "gd": 33 pass # no initialization required for gradient descent 34 elif optimizer == "momentum": 35 v = initialize_velocity(parameters) 36 elif optimizer == "adam": 37 v, s = initialize_adam(parameters) 38 39 # Optimization loop 40 for i in range(num_epochs): 41 42 # Define the random minibatches. We increment the seed to reshuffle differently the dataset after each epoch 43 seed = seed + 1 44 minibatches = random_mini_batches(X, Y, mini_batch_size, seed) 45 46 for minibatch in minibatches: 47 48 # Select a minibatch 49 (minibatch_X, minibatch_Y) = minibatch 50 51 # Forward propagation 52 a3, caches = forward_propagation(minibatch_X, parameters) 53 54 # Compute cost 55 cost = compute_cost(a3, minibatch_Y) 56 57 # Backward propagation 58 grads = backward_propagation(minibatch_X, minibatch_Y, caches) 59 60 # Update parameters 61 if optimizer == "gd": 62 parameters = update_parameters_with_gd(parameters, grads, learning_rate) 63 elif optimizer == "momentum": 64 parameters, v = update_parameters_with_momentum(parameters, grads, v, beta, learning_rate) 65 elif optimizer == "adam": 66 t = t + 1 # Adam counter 67 parameters, v, s = update_parameters_with_adam(parameters, grads, v, s, 68 t, learning_rate, beta1, beta2, epsilon) 69 70 # Print the cost every 1000 epoch 71 if print_cost and i % 1000 == 0: 72 print ("Cost after epoch %i: %f" %(i, cost)) 73 if print_cost and i % 100 == 0: 74 costs.append(cost) 75 76 # plot the cost 77 plt.plot(costs) 78 plt.ylabel('cost') 79 plt.xlabel('epochs (per 100)') 80 plt.title("Learning rate = " + str(learning_rate)) 81 plt.show() 82 83 return parameters