cs231n assignment 2
20210913 - 20211005。
fully-connected nets
把各种layer封装起来,就可以modular programming了。
封装一个forward,输入是computational graph节点的输入,输出是节点的输出+需要缓存的信息。
x_rsp = x.reshape(x.shape[0], -1) # N*d1*d2*... -> N*D,一行一个数据 A = B.dot(C) # 矩阵乘法 dx = dx.reshape(x.shape) # 把我reshape成你的shape out = x * (x >= 0) # relu:保留≥0的值,精简numpy写法 dx = (x > 0) * dout # relu的backprop
关于fully-connected layer中的w维度:
layer_input_dim = input_dim for i, hd in enumerate(hidden_dims): self.params['W%d'%(i+1)] = weight_scale * np.random.randn(layer_input_dim, hd) self.params['b%d'%(i+1)] = np.zeros(hd) if self.use_batchnorm: self.params['gamma%d'%(i+1)] = np.ones(hd) self.params['beta%d'%(i+1)] = np.zeros(hd) layer_input_dim = hd self.params['W%d'%(self.num_layers)] = weight_scale * np.random.randn(layer_input_dim, num_classes) self.params['b%d'%(self.num_layers)] = np.zeros(num_classes)
带momentum的stochastic gradient descent:
v = config['momentum'] * v - config['learning_rate'] * dw # 速度衰减0.9,再加上加速度的方向 next_w = w + v # 用速度更新W config['velocity'] = v # 记录更新后的速度
config['cache'] = config['decay_rate'] * config['cache'] + (1 - config['decay_rate']) * (dx**2) # cache:是以decay_rate为权重的,【原来cache】与【dx平方】的加权平均 next_x = x - config['learning_rate'] * dx / (np.sqrt(config['cache']) + config['epsilon']) # next_x:走learning_rate的步长,方向为负的 dx除sqrt(cache)+小epsilon(防止除0)。
config['t'] += 1 # t:每次更新W都++,用来牵制mb和vb的增长速度 config['m'] = config['beta1'] * config['m'] + (1 - config['beta1']) * dx # m:是以beta1为权重的,【原来m】与【dx】的加权平均 config['v'] = config['beta2'] * config['v'] + (1 - config['beta2']) * (dx**2) # v:是以beta2为权重的,【原来v】与【dx平方】的加权平均 mb = config['m'] / (1 - config['beta1']**config['t']) # mb:原来m 除 1-第一个β参数的t次方,变大了一点点。随着t越来越大,β1**t越来越小,1-β1**t越来越大,除以它就越来越小。因此mb的增加速率越来越小。 vb = config['v'] / (1 - config['beta2']**config['t']) # vb:原来v 除 1-第二个β参数的t次方,变大了一点点。与上面一样。 next_x = x - config['learning_rate'] * mb / (np.sqrt(vb) + config['epsilon']) # next_x:走learning_rate的步长,方向为负的 mb除sqrt(vb)+小epsilon。
- 我们要stochastic gradient descent,就要瞄准一个下降方向,走learning rate的步长。
- 瞄准什么方向呢,瞄准 mb 除 sqrt(vb)+epsilon 的方向。
- mb是干啥的呢,它是 m 除 (1-β1^t)。
- 除(1-β1t)是用来缓慢减小mb值的,随着t累加,β1t减小,1-β1^t增大,除它又减小,因此除它是用来缓慢减小mb值的。
- m是干啥的呢,它其实是momentum,更新公式是 原m与现dx的加权平均。
- 那vb是干啥的呢,它是 v 除 (1-β2^t)。
- 除(1-β2^t)啊,估计也是用来缓慢减小vb值的。虽然vb最后要放在前进方向的分母上,好矛盾诶。
- v是干啥的呢,是RMSProp的奇妙操作,更新公式是 原v与现dx²的加权平均。
- 因此,Adam综合了momentum和RMSProp,又沿着momentum方向前进,又除平方dx,同时还奇妙地用【除(1-β^t)】牵制两者。
复习multiclass svm loss和softmax loss
multiclass svm loss & derivative
好像又被叫做hinge loss。
N = x.shape[0] correct_class_scores = x[np.arange(N), y] # 正确类别的分数,N*1的向量 margins = np.maximum(0, x - correct_class_scores[:, np.newaxis] + 1.0) margins[np.arange(N), y] = 0 # 我们只计算错误类别 loss = np.sum(margins) / N # 对N个样本求loss,然后做平均作为最后的loss num_pos = np.sum(margins > 0, axis=1) dx = np.zeros_like(x) # x形状的全0矩阵 dx[margins > 0] = 1 # loss增大方向:错误类别分数增加 dx[np.arange(N), y] -= num_pos # loss增大方向:每个【错误类别分数增加】都对应一个【正确类别分数减小】 dx /= N # 对N个样本效果的平均 return loss, dx
softmax loss & derivative
又被叫做cross entropy loss。
shifted_logits = x - np.max(x, axis=1, keepdims=True) # 相当于exp(x)/exp(max(x)) Z = np.sum(np.exp(shifted_logits), axis=1, keepdims=True) # 相当于对exp(x)/exp(max(x))求sum,sum(exp(x))/exp(max(x)) # 也就是sum(exp(x))再除exp的max(x) log_probs = shifted_logits - np.log(Z) # 相当于exp(x)/sum(exp(x))的log,也就是概率的log,这样算省了很多exp # 关于貌似无用的“减去max(x)”:https://zhuanlan.zhihu.com/p/92714192 probs = np.exp(log_probs) # 这是概率 N = x.shape[0] loss = -np.sum(log_probs[np.arange(N), y]) / N # loss就是-log(正确概率),最后对N个样本取平均 dx = probs.copy() # 首先dx=算出来的概率 dx[np.arange(N), y] -= 1 # 然后所有正确分类的概率-=1 # 不知道为什么反正就这么算 dx /= N # 最后对N个样本做平均,因为每个样本对loss只贡献了1/N? return loss, dx
batch normalization
sample_mean = np.mean(x,axis=0) sample_var = np.var(x,axis=0) x_hat = (x - sample_mean) / (np.sqrt(sample_var+eps)) out = gamma * x_hat + beta cache = (gamma, x, sample_mean, sample_var, x_hat) running_mean = momentum * running_mean + (1-momentum) * sample_mean running_var = momentum * running_var + (1-momentum) * sample_var # test的时候 scale = gamma / (np.sqrt(running_var + eps)) out = x * scale + (beta - running_mean * scale) # 其实没什么区别,只是这样好像计算量小一点,能用标量尽量不用向量
# 估计我下次看也看不懂了 # 大意就是,x若变化,均值和方差也会变,求导时也要考虑这个。 gamma, x, sample_mean, sample_var, eps, x_hat = cache N = x.shape[0] dbeta = np.sum(dout, axis=0) # 是的,是sum,把每一个样本的影响累加 dgamma = np.sum(dout*x_hat, axis=0) dy_wrt_dmean = -gamma / np.sqrt(sample_var+eps) * dout dy_wrt_dvar = -0.5 * gamma * np.power(sample_var+eps,-1.5) dmean_wrt_dx = 1.0 / N # 是的,每个人都贡献了1/N。直接用1可能会整数除法? dvar_wrt_dx = 2.0 / N * (x-sample_mean) # 根据方差的计算公式 dy_wrt_dx = gamma / np.sqrt(sample_var+eps) * dout dx = dy_wrt_dx + dy_wrt_dmean * dmean_wrt_dx + dy_wrt_dvar * dvar_wrt_dx # 正确性存疑,虽然抄的别人的代码,但是有误差
inverted dropout:train的时候以p的概率随机把neuron赋0,也就是保留了(1-p)的原数值,然后再把所有数值除(1-p)(就像做平均一样),试图通过放大留下的(1-p)个人的影响,假装什么都没发生。test的时候,不需要做任何事情。
网络结构:affine - [batch norm] - relu - [dropout]。
# forward mask = (np.random.rand(*x.shape) >= p) / (1-p) out = x * mask # backward dx = dout * mask
在【fully connect - batch norm - relu - dropout】结构中添加dropout:forward时,在最后把输出dropout一下;backward时,把上一层的输出先做一个dropout backward。
convolutional networks
input的shape是(N, C, H, W),其中N是样本数量,C是channel个数(RGB),HW是高和宽。
filter的shape是(F, C, HH, WW),F是卷积核个数,HH是卷积核高,WW是卷积核宽。
output的shape是(N, F, H_out, W_out),对每一个样本 用F个filter 做卷积操作,因此第一个dimension是N,第二个是F。H_out和W_out是卷积后的高和宽。
H_out = 1 + (H + 2 * pad - HH) // stride W_out = 1 + (W + 2 * pad - WW) // stride
out[:, f, i, j] = np.sum(x_masked * w[f,:,:,:], axis=(1,2,3))
max pooling
input的shape是(N, C, H, W),pooling的参数有HH、WW和stride。
输出的shape是(N, C, H_out, W_out),其中H_out和W_out这样计算(同卷积):
H_out = 1 + (H - HH) // stride W_out = 1 + (W - WW) // stride
计算max的时候,使用np.max(x_masked, axis=(2,3))
spatial batch normalization
设input为四维矩阵 (N, C, H, W)。在cnn中,我们把每个 feather map 看成是一个特征处理(一个神经元),因此在使用 spatial batchnorm 的时候,mini-batch size 就是:N*H*W,于是对于每个特征图都只有两个可学习参数:γ、β。
# forward N, C, H, W = x.shape F, _, HH, WW = w.shape stride, pad = conv_param['stride'], conv_param['pad'] H_out = 1 + (H + 2 * pad - HH) // stride W_out = 1 + (W + 2 * pad - WW) // stride out = np.zeros((N, F, H_out, W_out)) x_pad = np.pad(x, ((0,0), (0,0), (pad,pad), (pad,pad)), mode='constant',constant_values=0) """ np.pad:填充数组的边缘,就是一个padding操作。 第一个参数是需要填充的数组。 第二个参数是填充大小,格式为((before_1, after_1), … (before_N, after_N)),其中(before_1, after_1)表示第1轴两边缘分别填充before_1个和after_1个数值。 最后一个参数表示填充的方式。 """ for i in range(H_out): for j in range(W_out): x_pad_masked = x_pad[:, :, i*stride:i*stride+HH, j*stride:j*stride+WW] for k in range(F): out[:, k, i, j] = np.sum(x_pad_masked * w[k, :, :, :], axis=(1,2,3)) for k in range(F): out[:, k, :, :] += b[k] # backward x, w, b, conv_param = cache N, C, H, W = x.shape F, _, HH, WW = w.shape stride, pad = conv_param['stride'], conv_param['pad'] N, F, H_out, W_out = dout.shape x_pad = np.pad(x, ((0,0), (0,0), (pad,pad), (pad,pad)), mode='constant', constant_values=0) dx = np.zeros_like(x) dx_pad = np.zeros_like(x_pad) dw = np.zeros_like(w) db = np.sum(dout, axis=(0,2,3)) for i in range(H_out): for j in range(W_out): x_pad_masked = x_pad[:, :, i*stride:i*stride+HH, j*stride:j*stride+WW] for k in range(F): # compute dw dw[k,:,:,:] += np.sum(x_pad_masked * (dout[:,k,i,j])[:, None, None, None], axis=0) # 对每个filter,sum用来累加N个样本的影响 for n in range(N): # compute dx_pad dx_pad[n, :, i*stride:i*stride+HH, j*stride:j*stride+WW] += np.sum((w[:,:,:,:] * (dout[n, :, i, j])[:,None ,None, None]), axis=0) # 对每个样本,sum用来累加F个filter带来的梯度 dx = dx_pad[:,:,pad:-pad,pad:-pad]
max pooling:
# forward N, C, H, W = x.shape HH, WW, stride = pool_param['pool_height'], pool_param['pool_width'], pool_param['stride'] H_out = (H - HH) // stride + 1 W_out = (W - WW) // stride + 1 out = np.zeros((N, C, H_out, W_out)) for i in range(H_out): for j in range(W_out): x_masked = x[:, :, i*stride:i*stride+HH, j*stride:j*stride+WW] out[:,:,i,j] = np.max(x_masked, axis=(2,3)) # backward x, pool_param = cache N, C, H, W = x.shape HH, WW, stride = pool_param['pool_height'], pool_param['pool_width'], pool_param['stride'] N, C, H_out, W_out = dout.shape dx = np.zeros_like(x) for i in range(H_out): for j in range(W_out): x_masked = x[:, :, i*stride:i*stride+HH, j*stride:j*stride+WW] max_x_masked = np.max(x_masked,axis=(2,3)) temp_binary_mask = (x_masked == (max_x_masked)[:,:,None,None]) # 如果出现多个数同时为max,那么这多个数都要继承梯度 dx[:, :, i*stride:i*stride+HH, j*stride:j*stride+WW] += temp_binary_mask * (dout[:,:,i,j])[:,:,None,None]
spatial batch normalization:
# forward N, C, H, W = x.shape temp_output, cache = batchnorm_forward(x.transpose(0,3,2,1).reshape((N*H*W,C)), gamma, beta, bn_param) out = temp_output.reshape(N,W,H,C).transpose(0,3,2,1) # backward
PyTorch quick start
import torch import torch.nn as nn import torch.optim as optim from torch.autograd import Variable from torch.utils.data import DataLoader from torch.utils.data import sampler import torchvision.datasets as dset import torchvision.transforms as T import numpy as np import timeit
dtype = torch.FloatTensor # the CPU datatype torch.cuda.is_available() # 用这个来看有没有GPU,如果有的话会返回True gpu_dtype = torch.cuda.FloatTensor # the GPU datatype
然后,我们定义一个flatten,它用来把 shape 为 N*C*H*W 的输入展开成 N*?? 的shape,就是一个np.reshape(x,(x.shape[0],-1))
class Flatten(nn.Module): def forward(self, x): N, C, H, W = x.size() # read in N, C, H, W return x.view(N, -1) # "flatten" the C * H * W values into a single vector per image
''' architecture: [conv - ReLU - BatchNorm - MaxPool] - [conv - ReLU - BatchNorm - MaxPool] - [affine - BatchNorm - ReLU] - [affine - softmax] ''' model_base = nn.Sequential(nn.Conv2d(in_channels=3,out_channels=16, kernel_size=5, stride=1), nn.ReLU(inplace=True), nn.BatchNorm2d(num_features=16), nn.MaxPool2d(kernel_size=2,stride=2), nn.Conv2d(in_channels=16,out_channels=32, kernel_size=3, stride=1), nn.ReLU(inplace=True), nn.BatchNorm2d(num_features=32), nn.MaxPool2d(kernel_size=2,stride=2), Flatten(), nn.Linear(1152,200), # 1152=32*6*6 input size nn.BatchNorm1d(num_features=200), nn.ReLU(inplace=True), nn.Linear(200, 10), # affine layer ) model = model_base.type(dtype) # 先定义base,再把具体数据类型套到base上 loss_fn = nn.CrossEntropyLoss().type(dtype) optimizer = optim.Adam(model.parameters(), lr=1e-3)
cs231n提供了训练和check accuracy的函数,我们直接抄过来:
def train(model, loss_fn, optimizer, num_epochs = 1): for epoch in range(num_epochs): print('Starting epoch %d / %d' % (epoch + 1, num_epochs)) model.train() for t, (x, y) in enumerate(loader_train): x_var = Variable(x.type(dtype)) y_var = Variable(y.type(dtype).long()) scores = model(x_var) loss = loss_fn(scores, y_var) if (t + 1) % print_every == 0: print('t = %d, loss = %.4f' % (t + 1, loss.item())) optimizer.zero_grad() loss.backward() optimizer.step() def check_accuracy(model, loader): if loader.dataset.train: print('Checking accuracy on validation set') else: print('Checking accuracy on test set') num_correct = 0 num_samples = 0 model.eval() # Put the model in test mode (the opposite of model.train(), essentially) for x, y in loader: with torch.no_grad(): x_var = Variable(x.type(dtype)) scores = model(x_var) _, preds = scores.data.cpu().max(1) num_correct += (preds == y).sum() num_samples += preds.size(0) acc = float(num_correct) / num_samples print('Got %d / %d correct (%.2f)' % (num_correct, num_samples, 100 * acc))
train(model, loss_fn, optimizer, num_epochs=10) check_accuracy(model, loader_val) # validation check_accuracy(best_model, loader_test) # test
TensorFlow quick start
import tensorflow.compat.v1 as tf tf.compat.v1.disable_eager_execution() import numpy as np import math import timeit import matplotlib.pyplot as plt %matplotlib inline
X = tf.placeholder(tf.float32, [None, 32, 32, 3]) y = tf.placeholder(tf.int64, [None]) is_training = tf.placeholder(tf.bool) # batchnorm时,train和test不一样,因此要记录一下
def my_model(X,y,is_training): # Conv-Relu-BN conv1act = tf.layers.conv2d(inputs=X, filters=32, padding='same', kernel_size=3, strides=1, activation=tf.nn.relu) bn1act = tf.layers.batch_normalization(inputs=conv1act, training=is_training) # Conv-Relu-BN conv2act = tf.layers.conv2d(inputs=bn1act, filters=64, padding='same', kernel_size=3, strides=1, activation=tf.nn.relu) bn2act = tf.layers.batch_normalization(inputs=conv2act, training=is_training) # Maxpool maxpool1act = tf.layers.max_pooling2d(inputs=bn2act, pool_size=2, strides=2) # Flatten flatten1 = tf.reshape(maxpool1act,[-1,16384]) # FC-Relu-BN fc1 = tf.layers.dense(inputs=flatten1, units=1024, activation=tf.nn.relu) bn3act = tf.layers.batch_normalization(inputs=fc1, training=is_training) # Output FC y_out = tf.layers.dense(inputs=bn3act, units=10, activation=None) return y_out
# clear old variables tf.reset_default_graph() y_out = my_model(X,y,is_training) mean_loss = tf.losses.softmax_cross_entropy(logits=y_out, onehot_labels=tf.one_hot(y,10)) optimizer = tf.train.AdamOptimizer(learning_rate=0.001) # batch normalization in tensorflow requires this extra dependency,好像是一个依赖的意思 extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) with tf.control_dependencies(extra_update_ops): train_step = optimizer.minimize(mean_loss)
def run_model(session, predict, loss_val, Xd, yd, epochs=1, batch_size=64, print_every=100, training=None, plot_losses=False): # have tensorflow compute accuracy correct_prediction = tf.equal(tf.argmax(predict,1), y) accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) # shuffle indicies train_indicies = np.arange(Xd.shape[0]) np.random.shuffle(train_indicies) training_now = training is not None # setting up variables we want to compute (and optimizing) # if we have a training function, add that to things we compute variables = [mean_loss,correct_prediction,accuracy] if training_now: variables[-1] = training # counter iter_cnt = 0 for e in range(epochs): # keep track of losses and accuracy correct = 0 losses = [] # make sure we iterate over the dataset once for i in range(int(math.ceil(Xd.shape[0]/batch_size))): # generate indicies for the batch start_idx = (i*batch_size)%Xd.shape[0] idx = train_indicies[start_idx:start_idx+batch_size] # create a feed dictionary for this batch feed_dict = {X: Xd[idx,:], y: yd[idx], is_training: training_now } # get batch size actual_batch_size = yd[idx].shape[0] # have tensorflow compute loss and correct predictions # and (if given) perform a training step loss, corr, _ = session.run(variables,feed_dict=feed_dict) # aggregate performance stats losses.append(loss*actual_batch_size) correct += np.sum(corr) # print every now and then if training_now and (iter_cnt % print_every) == 0: print("Iteration {0}: with minibatch training loss = {1:.3g} and accuracy of {2:.2g}"\ .format(iter_cnt,loss,np.sum(corr)/actual_batch_size)) iter_cnt += 1 total_correct = correct/Xd.shape[0] total_loss = np.sum(losses)/Xd.shape[0] print("Epoch {2}, Overall loss = {0:.3g} and accuracy of {1:.3g}"\ .format(total_loss,total_correct,e+1)) if plot_losses: plt.plot(losses) plt.grid(True) plt.title('Epoch {} Loss'.format(e+1)) plt.xlabel('minibatch number') plt.ylabel('minibatch loss') plt.show() return total_loss,total_correct
sess = tf.Session() # session封装了compute graph的状态和相关控制 sess.run(tf.global_variables_initializer()) print('Training') run_model(sess,y_out,mean_loss,X_train,y_train,10,64,100,train_step,True) print('Validation') run_model(sess,y_out,mean_loss,X_val,y_val,1,64)
print('Test') run_model(sess,y_out,mean_loss,X_test,y_test,1,64)
