笔记:CS231n+assignment2(作业二)(一)
第二个作业难度很高,但做(抄)完之后收获还是很大的....
首先是对之前的神经网络的程序进行重构,目的是可以构建任意大小的全连接的neural network,这里用模块化的思想构建整个代码,具体思路如下:
#前向传播 def layer_forward(x, w): """ Receive inputs x and weights w """ # 做前向计算 z = # 需要存储的中间值,便于BP的时候使用 # Do some more computations ... out = # the output cache = (x, w, z, out) # Values we need to compute gradients return out, cache #后向传播 def layer_backward(dout, cache): """ Receive derivative of loss with respect to outputs and cache, and compute derivative with respect to inputs. """ # Unpack cache values x, w, z, out = cache # Use values in cache to compute derivatives dx = # Derivative of loss with respect to x dw = # Derivative of loss with respect to w return dx, dw
在上面的思想指导下,要求实现下面的代码:
def affine_forward(x, w, b): """ X的shape是(N,d_1,d_2,...d_k),第一维带便minibatch的数目,后面是把图片的shape,所以进来的时候把后面全面转为
一维的向量 Inputs: - x: A numpy array containing input data, of shape (N, d_1, ..., d_k) - w: A numpy array of weights, of shape (D, M) - b: A numpy array of biases, of shape (M,) Returns a tuple of: - out: output, of shape (N, M) - cache: (x, w, b) """ out = None N=x.shape[0] x_new=x.reshape(N,-1)#转为二维向量 out=np.dot(x_new,w)+b cache = (x, w, b) # 不需要保存out return out, cache def affine_backward(dout, cache): x, w, b = cache dx, dw, db = None, None, None dx=np.dot(dout,w.T) dx=np.reshape(dx,x.shape) x_new=x.reshape(x.shape[0],-1) dw=np.dot(x_new.T,dout) db=np.sum(dout,axis=0,keepdims=True) return dx, dw, db def relu_forward(x): """ Computes the forward pass for a layer of rectified linear units (ReLUs). Input: - x: Inputs, of any shape Returns a tuple of: - out: Output, of the same shape as x - cache: x """ out = None out=np.maximum(0,x) cache = x return out, cache def relu_backward(dout, cache): dx, x = None, cache ############################################################################# # TODO: Implement the ReLU backward pass. # ############################################################################# dx=dout dx[x<=0]=0 ############################################################################# # END OF YOUR CODE # ############################################################################# return dx
上面值得商讨的就是为什么求db的公式是db=np.sum(dout,axis=0,keepdims=True),在我看来是少了一个平均的操作的,个人感觉还是因为db的作用小,所以这里用sum的话会方便...grandient check的代码不需要专门为它进行改变。
完成上面两个基本的layer,就可以构建一个Sandwich的层了,因为fc-relu的使用还是比较常见的,所以这里直接构建了出来:
def affine_relu_forward(x, w, b): """ Convenience layer that perorms an affine transform followed by a ReLU Inputs: - x: Input to the affine layer - w, b: Weights for the affine layer Returns a tuple of: - out: Output from the ReLU - cache: Object to give to the backward pass """ a, fc_cache = affine_forward(x, w, b) out, relu_cache = relu_forward(a) cache = (fc_cache, relu_cache) return out, cache def affine_relu_backward(dout, cache): """ Backward pass for the affine-relu convenience layer """ fc_cache, relu_cache = cache da = relu_backward(dout, relu_cache) dx, dw, db = affine_backward(da, fc_cache) return dx, dw, db
后面有一个构建上层layer的网络,我不准备说了,直接聊一聊一个迄今为止最厉害的类FullyConnectecNEt吧,先上代码和注释:
1 class FullyConnectedNet(object):
2 """ 3 A fully-connected neural network with an arbitrary number of hidden layers, 4 ReLU nonlinearities, and a softmax loss function. This will also implement 5 dropout and batch normalization as options. For a network with L layers, 6 the architecture will be 7 8 {affine - [batch norm] - relu - [dropout]} x (L - 1) - affine - softmax 9 10 where batch normalization and dropout are optional, and the {...} block is 11 repeated L - 1 times. 12 13 Similar to the TwoLayerNet above, learnable parameters are stored in the 14 self.params dictionary and will be learned using the Solver class. 15 """ 16 17 def __init__(self, hidden_dims, input_dim=3*32*32, num_classes=10, 18 dropout=0, use_batchnorm=False, reg=0.0, 19 weight_scale=1e-2, dtype=np.float32, seed=None): 20 """ 21 Initialize a new FullyConnectedNet. 22 23 Inputs: 24 - hidden_dims: A list of integers giving the size of each hidden layer. 25 - input_dim: An integer giving the size of the input. 26 - num_classes: An integer giving the number of classes to classify. 27 - dropout: Scalar between 0 and 1 giving dropout strength. If dropout=0 then 28 the network should not use dropout at all. 29 - use_batchnorm: Whether or not the network should use batch normalization. 30 - reg: Scalar giving L2 regularization strength. 31 - weight_scale: Scalar giving the standard deviation for random 32 initialization of the weights. 33 - dtype: A numpy datatype object; all computations will be performed using 34 this datatype. float32 is faster but less accurate, so you should use 35 float64 for numeric gradient checking. 36 - seed: If not None, then pass this random seed to the dropout layers. This 37 will make the dropout layers deteriminstic so we can gradient check the 38 model. 39 """ 40 self.use_batchnorm = use_batchnorm 41 self.use_dropout = dropout > 0 42 self.reg = reg 43 self.num_layers = 1 + len(hidden_dims) 44 self.dtype = dtype 45 self.params = {} 46 47 ############################################################################ 48 # TODO: Initialize the parameters of the network, storing all values in # 49 # the self.params dictionary. Store weights and biases for the first layer # 50 # in W1 and b1; for the second layer use W2 and b2, etc. Weights should be # 51 # initialized from a normal distribution with standard deviation equal to # 52 # weight_scale and biases should be initialized to zero. # 53 # # 54 # When using batch normalization, store scale and shift parameters for the # 55 # first layer in gamma1 and beta1; for the second layer use gamma2 and # 56 # beta2, etc. Scale parameters should be initialized to one and shift # 57 # parameters should be initialized to zero. # 58 ############################################################################
59 layers_dims = [input_dim] + hidden_dims + [num_classes] #z这里存储的是每个layer的大小,因为中间的是list,所以要把前后连个加上list来做 60 for i in xrange(self.num_layers): 61 self.params['W' + str(i + 1)] = weight_scale * np.random.randn(layers_dims[i], layers_dims[i + 1]) 62 self.params['b' + str(i + 1)] = np.zeros((1, layers_dims[i + 1])) 63 if self.use_batchnorm and i < len(hidden_dims):#最后一层是不需要batchnorm的 64 self.params['gamma' + str(i + 1)] = np.ones((1, layers_dims[i + 1])) 65 self.params['beta' + str(i + 1)] = np.zeros((1, layers_dims[i + 1])) 66 ############################################################################ 67 # END OF YOUR CODE # 68 ############################################################################ 69 70 # When using dropout we need to pass a dropout_param dictionary to each 71 # dropout layer so that the layer knows the dropout probability and the mode 72 # (train / test). You can pass the same dropout_param to each dropout layer. 73 self.dropout_param = {} 74 if self.use_dropout: 75 self.dropout_param = {'mode': 'train', 'p': dropout} 76 if seed is not None: 77 self.dropout_param['seed'] = seed 78 79 # With batch normalization we need to keep track of running means and 80 # variances, so we need to pass a special bn_param object to each batch 81 # normalization layer. You should pass self.bn_params[0] to the forward pass 82 # of the first batch normalization layer, self.bn_params[1] to the forward 83 # pass of the second batch normalization layer, etc. 84 self.bn_params = [] 85 if self.use_batchnorm: 86 self.bn_params = [{'mode': 'train'} for i in xrange(self.num_layers - 1)] 87 88 # Cast all parameters to the correct datatype 89 for k, v in self.params.iteritems(): 90 self.params[k] = v.astype(dtype) 91 92 93 def loss(self, X, y=None): 94 """ 95 Compute loss and gradient for the fully-connected net. 96 97 Input / output: Same as TwoLayerNet above. 98 """ 99 X = X.astype(self.dtype) 100 mode = 'test' if y is None else 'train' 101 102 # Set train/test mode for batchnorm params and dropout param since they 103 # behave differently during training and testing. 104 if self.dropout_param is not None: 105 self.dropout_param['mode'] = mode 106 if self.use_batchnorm: 107 for bn_param in self.bn_params: 108 bn_param[mode] = mode 109 110 scores = None 111 ############################################################################ 112 # TODO: Implement the forward pass for the fully-connected net, computing # 113 # the class scores for X and storing them in the scores variable. # 114 # # 115 # When using dropout, you'll need to pass self.dropout_param to each # 116 # dropout forward pass. # 117 # # 118 # When using batch normalization, you'll need to pass self.bn_params[0] to # 119 # the forward pass for the first batch normalization layer, pass # 120 # self.bn_params[1] to the forward pass for the second batch normalization # 121 # layer, etc. # 122 ############################################################################ 123 h, cache1, cache2, cache3,cache4, bn, out = {}, {}, {}, {}, {}, {},{} 124 out[0] = X #存储每一层的out,按照逻辑,X就是out0[0] 125 126 # Forward pass: compute loss 127 for i in xrange(self.num_layers - 1): 128 # 得到每一层的参数 129 w, b = self.params['W' + str(i + 1)], self.params['b' + str(i + 1)] 130 if self.use_batchnorm: 131 gamma, beta = self.params['gamma' + str(i + 1)], self.params['beta' + str(i + 1)] 132 h[i], cache1[i] = affine_forward(out[i], w, b) 133 bn[i], cache2[i] = batchnorm_forward(h[i], gamma, beta, self.bn_params[i]) 134 out[i + 1], cache3[i] = relu_forward(bn[i]) 135 if self.use_dropout: 136 out[i+1], cache4[i] = dropout_forward(out[i+1] , self.dropout_param) 137 else: 138 out[i + 1], cache3[i] = affine_relu_forward(out[i], w, b) 139 if self.use_dropout: 140 out[i + 1], cache4[i] = dropout_forward(out[i + 1], self.dropout_param) 141 142 W, b = self.params['W' + str(self.num_layers)], self.params['b' + str(self.num_layers)] 143 scores, cache = affine_forward(out[self.num_layers - 1], W, b) #对最后一层进行计算144 145 ############################################################################ 146 # END OF YOUR CODE # 147 ############################################################################ 148 149 # If test mode return early 150 if mode == 'test': 151 return scores 152 153 loss, grads = 0.0, {} 154 ############################################################################ 155 # TODO: Implement the backward pass for the fully-connected net. Store the # 156 # loss in the loss variable and gradients in the grads dictionary. Compute # 157 # data loss using softmax, and make sure that grads[k] holds the gradients # 158 # for self.params[k]. Don't forget to add L2 regularization! # 159 # # 160 # When using batch normalization, you don't need to regularize the scale # 161 # and shift parameters. # 162 # # 163 # NOTE: To ensure that your implementation matches ours and you pass the # 164 # automated tests, make sure that your L2 regularization includes a factor # 165 # of 0.5 to simplify the expression for the gradient. # 166 ############################################################################ 167 data_loss, dscores = softmax_loss(scores, y) 168 reg_loss = 0 169 for i in xrange(self.num_layers): 170 reg_loss += 0.5 * self.reg * np.sum(self.params['W' + str(i + 1)] * self.params['W' + str(i + 1)]) 171 loss = data_loss + reg_loss 172 173 # Backward pass: compute gradients 174 dout, dbn, dh, ddrop = {}, {}, {}, {} 175 t = self.num_layers - 1 176 dout[t], grads['W' + str(t + 1)], grads['b' + str(t + 1)] = affine_backward(dscores, cache)#这个cache就是上面得到的177 for i in xrange(t): 178 if self.use_batchnorm: 179 if self.use_dropout: 180 dout[t - i] = dropout_backward(dout[t-i], cache4[t-1-i]) 181 dbn[t - 1 - i] = relu_backward(dout[t - i], cache3[t - 1 - i]) 182 dh[t - 1 - i], grads['gamma' + str(t - i)], grads['beta' + str(t - i)] = batchnorm_backward(dbn[t - 1 - i], 183 cache2[ 184 t - 1 - i]) 185 dout[t - 1 - i], grads['W' + str(t - i)], grads['b' + str(t - i)] = affine_backward(dh[t - 1 - i], 186 cache1[t - 1 - i]) 187 else: 188 if self.use_dropout: 189 dout[t - i] = dropout_backward(dout[t - i], cache4[t - 1 - i]) 190 191 dout[t - 1 - i], grads['W' + str(t - i)], grads['b' + str(t - i)] = affine_relu_backward(dout[t - i], 192 cache3[t - 1 - i]) 193 194 # Add the regularization gradient contribution 195 for i in xrange(self.num_layers): 196 grads['W' + str(i + 1)] += self.reg * self.params['W' + str(i + 1)] 197 ############################################################################ 198 # END OF YOUR CODE # 199 ############################################################################ 200 201 return loss, grads
上面的代码因为是上层代码,不需要关心具体的Bp如何实现(因为之前已经实现了),所以还是很好看懂的,但到现在还是没有结束的,我们还要使用slover来对
神经网络进优化求解。
1 import numpy as np 2 3 from cs231n import optim 4 5 6 class Solver(object): 7 """ 8 A Solver encapsulates all the logic necessary for training classification 9 models. The Solver performs stochastic gradient descent using different 10 update rules defined in optim.py. 11 12 The solver accepts both training and validataion data and labels so it can 13 periodically check classification accuracy on both training and validation 14 data to watch out for overfitting. 15 16 To train a model, you will first construct a Solver instance, passing the 17 model, dataset, and various optoins (learning rate, batch size, etc) to the 18 constructor. You will then call the train() method to run the optimization 19 procedure and train the model. 20 21 After the train() method returns, model.params will contain the parameters 22 that performed best on the validation set over the course of training. 23 In addition, the instance variable solver.loss_history will contain a list 24 of all losses encountered during training and the instance variables 25 solver.train_acc_history and solver.val_acc_history will be lists containing 26 the accuracies of the model on the training and validation set at each epoch. 27 28 Example usage might look something like this: 29 30 data = { 31 'X_train': # training data 32 'y_train': # training labels 33 'X_val': # validation data 34 'X_train': # validation labels 35 } 36 model = MyAwesomeModel(hidden_size=100, reg=10) 37 solver = Solver(model, data, 38 update_rule='sgd', 39 optim_config={ 40 'learning_rate': 1e-3, 41 }, 42 lr_decay=0.95, 43 num_epochs=10, batch_size=100, 44 print_every=100) 45 solver.train() 46 47 48 A Solver works on a model object that must conform to the following API: 49 50 - model.params must be a dictionary mapping string parameter names to numpy 51 arrays containing parameter values. 52 53 - model.loss(X, y) must be a function that computes training-time loss and 54 gradients, and test-time classification scores, with the following inputs 55 and outputs: 56 57 Inputs: 58 - X: Array giving a minibatch of input data of shape (N, d_1, ..., d_k) 59 - y: Array of labels, of shape (N,) giving labels for X where y[i] is the 60 label for X[i]. 61 62 Returns: 63 If y is None, run a test-time forward pass and return: 64 - scores: Array of shape (N, C) giving classification scores for X where 65 scores[i, c] gives the score of class c for X[i]. 66 67 If y is not None, run a training time forward and backward pass and return 68 a tuple of: 69 - loss: Scalar giving the loss 70 - grads: Dictionary with the same keys as self.params mapping parameter 71 names to gradients of the loss with respect to those parameters. 72 """ 73 74 def __init__(self, model, data, **kwargs): 75 """ 76 Construct a new Solver instance. 77 78 Required arguments: 79 - model: A model object conforming to the API described above 80 - data: A dictionary of training and validation data with the following: 81 'X_train': Array of shape (N_train, d_1, ..., d_k) giving training images 82 'X_val': Array of shape (N_val, d_1, ..., d_k) giving validation images 83 'y_train': Array of shape (N_train,) giving labels for training images 84 'y_val': Array of shape (N_val,) giving labels for validation images 85 86 Optional arguments: 87 - update_rule: A string giving the name of an update rule in optim.py. 88 Default is 'sgd'. 89 - optim_config: A dictionary containing hyperparameters that will be 90 passed to the chosen update rule. Each update rule requires different 91 hyperparameters (see optim.py) but all update rules require a 92 'learning_rate' parameter so that should always be present. 93 - lr_decay: A scalar for learning rate decay; after each epoch the learning 94 rate is multiplied by this value. 95 - batch_size: Size of minibatches used to compute loss and gradient during 96 training. 97 - num_epochs: The number of epochs to run for during training. 98 - print_every: Integer; training losses will be printed every print_every 99 iterations. 100 - verbose: Boolean; if set to false then no output will be printed during 101 training. 102 """ 103 self.model = model 104 self.X_train = data['X_train'] 105 self.y_train = data['y_train'] 106 self.X_val = data['X_val'] 107 self.y_val = data['y_val'] 108 109 # Unpack keyword arguments 110 self.update_rule = kwargs.pop('update_rule', 'sgd') 111 self.optim_config = kwargs.pop('optim_config', {}) 112 self.lr_decay = kwargs.pop('lr_decay', 1.0) 113 self.batch_size = kwargs.pop('batch_size', 100) 114 self.num_epochs = kwargs.pop('num_epochs', 10) 115 116 self.print_every = kwargs.pop('print_every', 10) 117 self.verbose = kwargs.pop('verbose', True) 118 119 # Throw an error if there are extra keyword arguments 120 if len(kwargs) > 0: 121 extra = ', '.join('"%s"' % k for k in kwargs.keys()) 122 raise ValueError('Unrecognized arguments %s' % extra) 123 124 # Make sure the update rule exists, then replace the string 125 # name with the actual function 126 if not hasattr(optim, self.update_rule): 127 raise ValueError('Invalid update_rule "%s"' % self.update_rule) 128 self.update_rule = getattr(optim, self.update_rule) 129 130 self._reset() 131 132 133 def _reset(self): 134 """ 135 Set up some book-keeping variables for optimization. Don't call this 136 manually. 137 """ 138 # Set up some variables for book-keeping 139 self.epoch = 0 140 self.best_val_acc = 0 141 self.best_params = {} 142 self.loss_history = [] 143 self.train_acc_history = [] 144 self.val_acc_history = [] 145 146 # Make a deep copy of the optim_config for each parameter 147 self.optim_configs = {} 148 for p in self.model.params: 149 d = {k: v for k, v in self.optim_config.iteritems()} 150 self.optim_configs[p] = d 151 152 153 def _step(self): 154 """ 155 Make a single gradient update. This is called by train() and should not 156 be called manually. 157 """ 158 # Make a minibatch of training data 159 num_train = self.X_train.shape[0] 160 batch_mask = np.random.choice(num_train, self.batch_size) 161 X_batch = self.X_train[batch_mask] 162 y_batch = self.y_train[batch_mask] 163 164 # Compute loss and gradient 165 loss, grads = self.model.loss(X_batch, y_batch) 166 self.loss_history.append(loss) 167 168 # Perform a parameter update 169 for p, w in self.model.params.iteritems(): 170 dw = grads[p] 171 config = self.optim_configs[p] 172 next_w, next_config = self.update_rule(w, dw, config) #因为有很多update的方法 173 self.model.params[p] = next_w 174 self.optim_configs[p] = next_config 175 176 177 def check_accuracy(self, X, y, num_samples=None, batch_size=100): 178 """ 179 Check accuracy of the model on the provided data. 180 181 Inputs: 182 - X: Array of data, of shape (N, d_1, ..., d_k) 183 - y: Array of labels, of shape (N,) 184 - num_samples: If not None, subsample the data and only test the model 185 on num_samples datapoints. 186 - batch_size: Split X and y into batches of this size to avoid using too 187 much memory. 188 189 Returns: 190 - acc: Scalar giving the fraction of instances that were correctly 191 classified by the model. 192 """ 193 194 # Maybe subsample the data 195 N = X.shape[0] 196 if num_samples is not None and N > num_samples: 197 mask = np.random.choice(N, num_samples) 198 N = num_samples 199 X = X[mask] 200 y = y[mask] 201 202 # Compute predictions in batches 203 num_batches = N / batch_size 204 if N % batch_size != 0: 205 num_batches += 1 206 y_pred = [] 207 for i in xrange(num_batches): 208 start = i * batch_size 209 end = (i + 1) * batch_size 210 scores = self.model.loss(X[start:end]) 211 y_pred.append(np.argmax(scores, axis=1)) 212 y_pred = np.hstack(y_pred) 213 acc = np.mean(y_pred == y) 214 215 return acc 216 217 218 def train(self): 219 """ 220 Run optimization to train the model. 221 """ 222 num_train = self.X_train.shape[0] 223 iterations_per_epoch = max(num_train / self.batch_size, 1) 224 num_iterations = self.num_epochs * iterations_per_epoch 225 226 for t in xrange(num_iterations): 227 self._step() 228 229 # Maybe print training loss 230 if self.verbose and t % self.print_every == 0: 231 print '(Iteration %d / %d) loss: %f' % ( 232 t + 1, num_iterations, self.loss_history[-1]) 233 234 # At the end of every epoch, increment the epoch counter and decay the 235 # learning rate. 236 epoch_end = (t + 1) % iterations_per_epoch == 0 237 if epoch_end: 238 self.epoch += 1 239 for k in self.optim_configs: 240 self.optim_configs[k]['learning_rate'] *= self.lr_decay 241 242 # Check train and val accuracy on the first iteration, the last 243 # iteration, and at the end of each epoch. 244 first_it = (t == 0) 245 last_it = (t == num_iterations + 1) 246 if first_it or last_it or epoch_end: 247 train_acc = self.check_accuracy(self.X_train, self.y_train, 248 num_samples=1000) 249 val_acc = self.check_accuracy(self.X_val, self.y_val) 250 self.train_acc_history.append(train_acc) 251 self.val_acc_history.append(val_acc) 252 253 if self.verbose: 254 print '(Epoch %d / %d) train acc: %f; val_acc: %f' % ( 255 self.epoch, self.num_epochs, train_acc, val_acc) 256 257 # Keep track of the best model 258 if val_acc > self.best_val_acc: 259 self.best_val_acc = val_acc 260 self.best_params = {} 261 for k, v in self.model.params.iteritems(): 262 self.best_params[k] = v.copy() 263 264 # At the end of training swap the best params into the model 265 self.model.params = self.best_params
至此可以说构建了一个deep learning全连接网络的框架,我们可以来回顾一下具体做了些事:
1.编写全连接层,Relu层的前向传播和反向传播算法。
2.编写Sandwich的函数,只是将上面的集成起来而已。
3.编一个FUllyconnect的类,功能是:传入neural network相应的参数,得到一个对应的model。
4.编写一个solver的类,功能是:传入model和图片,进行最后的最优求解。
有哪些问题呢:
1.前向传播的时候需要保存一些参数,这里直接返回cache和out。
2.编写多层的时候需要注意很多点,各层的参数,注意,对于i层,它的输入时out[i],输出是out[i+1],参数信息是cache[i]。
3.SGD的update的rule毕竟还是太navie了,之后可以尝试一下别的。
写好了上面的代码,然后呢?
这里有一些很有用的trick需要记住的。
当你构建了一个neural network准备去跑你的数据集的时候,你肯定不能一次就去跑那个最大最原始的,最好的方法是先去overfitting一个小数据集,证明你的网络是有不错的学习能力的,这个时候就要大胆调参了...个人建议LR小一点,迭代次数多一点,scale也要看情况。
总结:第二次作业的内容很多,这次先说这么多了,未完待续。