DL 基础 | cs231n assignment 2
cs231n assignment 2
20210913 - 20211005。
fully-connected nets
基本思想
把各种layer封装起来,就可以modular programming了。
封装一个forward,输入是computational graph节点的输入,输出是节点的输出+需要缓存的信息。
封装一个backward,输入是upstream的derivative即计算图节点输出的derivative,输出是各个计算图节点输入的derivative。
backward时可以根据链式法则,按照维度无脑矩阵乘法算偏导数。
编程细节
x_rsp = x.reshape(x.shape[0], -1) # N*d1*d2*... -> N*D,一行一个数据 A = B.dot(C) # 矩阵乘法 dx = dx.reshape(x.shape) # 把我reshape成你的shape out = x * (x >= 0) # relu:保留≥0的值,精简numpy写法 dx = (x > 0) * dout # relu的backprop
关于fully-connected layer中的w维度:
layer_input_dim = input_dim for i, hd in enumerate(hidden_dims): self.params['W%d'%(i+1)] = weight_scale * np.random.randn(layer_input_dim, hd) self.params['b%d'%(i+1)] = np.zeros(hd) if self.use_batchnorm: self.params['gamma%d'%(i+1)] = np.ones(hd) self.params['beta%d'%(i+1)] = np.zeros(hd) layer_input_dim = hd self.params['W%d'%(self.num_layers)] = weight_scale * np.random.randn(layer_input_dim, num_classes) self.params['b%d'%(self.num_layers)] = np.zeros(num_classes)
带momentum的stochastic gradient descent:
v = config['momentum'] * v - config['learning_rate'] * dw # 速度衰减0.9,再加上加速度的方向 next_w = w + v # 用速度更新W config['velocity'] = v # 记录更新后的速度
RMSProp:
config['cache'] = config['decay_rate'] * config['cache'] + (1 - config['decay_rate']) * (dx**2) # cache:是以decay_rate为权重的,【原来cache】与【dx平方】的加权平均 next_x = x - config['learning_rate'] * dx / (np.sqrt(config['cache']) + config['epsilon']) # next_x:走learning_rate的步长,方向为负的 dx除sqrt(cache)+小epsilon(防止除0)。
Adam:
config['t'] += 1 # t:每次更新W都++,用来牵制mb和vb的增长速度 config['m'] = config['beta1'] * config['m'] + (1 - config['beta1']) * dx # m:是以beta1为权重的,【原来m】与【dx】的加权平均 config['v'] = config['beta2'] * config['v'] + (1 - config['beta2']) * (dx**2) # v:是以beta2为权重的,【原来v】与【dx平方】的加权平均 mb = config['m'] / (1 - config['beta1']**config['t']) # mb:原来m 除 1-第一个β参数的t次方,变大了一点点。随着t越来越大,β1**t越来越小,1-β1**t越来越大,除以它就越来越小。因此mb的增加速率越来越小。 vb = config['v'] / (1 - config['beta2']**config['t']) # vb:原来v 除 1-第二个β参数的t次方,变大了一点点。与上面一样。 next_x = x - config['learning_rate'] * mb / (np.sqrt(vb) + config['epsilon']) # next_x:走learning_rate的步长,方向为负的 mb除sqrt(vb)+小epsilon。
Adam是怎么一回事呢,就是:
- 我们要stochastic gradient descent,就要瞄准一个下降方向,走learning rate的步长。
- 瞄准什么方向呢,瞄准 mb 除 sqrt(vb)+epsilon 的方向。
- mb是干啥的呢,它是 m 除 (1-β1^t)。
- 除(1-β1t)是用来缓慢减小mb值的,随着t累加,β1t减小,1-β1^t增大,除它又减小,因此除它是用来缓慢减小mb值的。
- m是干啥的呢,它其实是momentum,更新公式是 原m与现dx的加权平均。
- 那vb是干啥的呢,它是 v 除 (1-β2^t)。
- 除(1-β2^t)啊,估计也是用来缓慢减小vb值的。虽然vb最后要放在前进方向的分母上,好矛盾诶。
- v是干啥的呢,是RMSProp的奇妙操作,更新公式是 原v与现dx²的加权平均。
- 因此,Adam综合了momentum和RMSProp,又沿着momentum方向前进,又除平方dx,同时还奇妙地用【除(1-β^t)】牵制两者。
复习multiclass svm loss和softmax loss
multiclass svm loss & derivative
好像又被叫做hinge loss。
N = x.shape[0] correct_class_scores = x[np.arange(N), y] # 正确类别的分数,N*1的向量 margins = np.maximum(0, x - correct_class_scores[:, np.newaxis] + 1.0) margins[np.arange(N), y] = 0 # 我们只计算错误类别 loss = np.sum(margins) / N # 对N个样本求loss,然后做平均作为最后的loss num_pos = np.sum(margins > 0, axis=1) dx = np.zeros_like(x) # x形状的全0矩阵 dx[margins > 0] = 1 # loss增大方向:错误类别分数增加 dx[np.arange(N), y] -= num_pos # loss增大方向:每个【错误类别分数增加】都对应一个【正确类别分数减小】 dx /= N # 对N个样本效果的平均 return loss, dx
softmax loss & derivative
又被叫做cross entropy loss。
shifted_logits = x - np.max(x, axis=1, keepdims=True) # 相当于exp(x)/exp(max(x)) Z = np.sum(np.exp(shifted_logits), axis=1, keepdims=True) # 相当于对exp(x)/exp(max(x))求sum,sum(exp(x))/exp(max(x)) # 也就是sum(exp(x))再除exp的max(x) log_probs = shifted_logits - np.log(Z) # 相当于exp(x)/sum(exp(x))的log,也就是概率的log,这样算省了很多exp # 关于貌似无用的“减去max(x)”:https://zhuanlan.zhihu.com/p/92714192 probs = np.exp(log_probs) # 这是概率 N = x.shape[0] loss = -np.sum(log_probs[np.arange(N), y]) / N # loss就是-log(正确概率),最后对N个样本取平均 dx = probs.copy() # 首先dx=算出来的概率 dx[np.arange(N), y] -= 1 # 然后所有正确分类的概率-=1 # 不知道为什么反正就这么算 dx /= N # 最后对N个样本做平均,因为每个样本对loss只贡献了1/N? return loss, dx
batch normalization
基本思想
先把一个minibatch的数据变成0均值1方差,然后再乘γ加β。这是一个特殊的层。
它一般被用在ReLU层前面。
编程细节
forward:
sample_mean = np.mean(x,axis=0) sample_var = np.var(x,axis=0) x_hat = (x - sample_mean) / (np.sqrt(sample_var+eps)) out = gamma * x_hat + beta cache = (gamma, x, sample_mean, sample_var, x_hat) running_mean = momentum * running_mean + (1-momentum) * sample_mean running_var = momentum * running_var + (1-momentum) * sample_var # test的时候 scale = gamma / (np.sqrt(running_var + eps)) out = x * scale + (beta - running_mean * scale) # 其实没什么区别,只是这样好像计算量小一点,能用标量尽量不用向量
backward:
# 估计我下次看也看不懂了 # 大意就是,x若变化,均值和方差也会变,求导时也要考虑这个。 gamma, x, sample_mean, sample_var, eps, x_hat = cache N = x.shape[0] dbeta = np.sum(dout, axis=0) # 是的,是sum,把每一个样本的影响累加 dgamma = np.sum(dout*x_hat, axis=0) dy_wrt_dmean = -gamma / np.sqrt(sample_var+eps) * dout dy_wrt_dvar = -0.5 * gamma * np.power(sample_var+eps,-1.5) dmean_wrt_dx = 1.0 / N # 是的,每个人都贡献了1/N。直接用1可能会整数除法? dvar_wrt_dx = 2.0 / N * (x-sample_mean) # 根据方差的计算公式 dy_wrt_dx = gamma / np.sqrt(sample_var+eps) * dout dx = dy_wrt_dx + dy_wrt_dmean * dmean_wrt_dx + dy_wrt_dvar * dvar_wrt_dx # 正确性存疑,虽然抄的别人的代码,但是有误差
方差计算公式:
dropout
基本思想
原dropout:train的时候以p的概率随机把neuron赋0,test的时候把整层的输出乘(1-p)。
inverted dropout:train的时候以p的概率随机把neuron赋0,也就是保留了(1-p)的原数值,然后再把所有数值除(1-p)(就像做平均一样),试图通过放大留下的(1-p)个人的影响,假装什么都没发生。test的时候,不需要做任何事情。
网络结构:affine - [batch norm] - relu - [dropout]。
编程细节
# forward mask = (np.random.rand(*x.shape) >= p) / (1-p) out = x * mask # backward dx = dout * mask
在【fully connect - batch norm - relu - dropout】结构中添加dropout:forward时,在最后把输出dropout一下;backward时,把上一层的输出先做一个dropout backward。
convolutional networks
基本思想
convolution
input的shape是(N, C, H, W),其中N是样本数量,C是channel个数(RGB),HW是高和宽。
filter的shape是(F, C, HH, WW),F是卷积核个数,HH是卷积核高,WW是卷积核宽。
output的shape是(N, F, H_out, W_out),对每一个样本 用F个filter 做卷积操作,因此第一个dimension是N,第二个是F。H_out和W_out是卷积后的高和宽。
卷积还有一个biases参数,是长度为F的向量,负责整体平移卷积后的map。
还有两个超参数:stride步长、pad填充。
H_out和W_out这样计算:
H_out = 1 + (H + 2 * pad - HH) // stride W_out = 1 + (W + 2 * pad - WW) // stride
算卷积结果的时候,这样写:(naive)
out[:, f, i, j] = np.sum(x_masked * w[f,:,:,:], axis=(1,2,3))
max pooling
input的shape是(N, C, H, W),pooling的参数有HH、WW和stride。
我们每次考虑HH*WW的方形区域,记录该区域的最大值,每次走stride的步长。
输出的shape是(N, C, H_out, W_out),其中H_out和W_out这样计算(同卷积):
H_out = 1 + (H - HH) // stride W_out = 1 + (W - WW) // stride
计算max的时候,使用np.max(x_masked, axis=(2,3))
。
spatial batch normalization
设input为四维矩阵 (N, C, H, W)。在cnn中,我们把每个 feather map 看成是一个特征处理(一个神经元),因此在使用 spatial batchnorm 的时候,mini-batch size 就是:N*H*W,于是对于每个特征图都只有两个可学习参数:γ、β。
也就是说,求取所有样本的某一个特征图的【所有】神经元的均值方差,然后对这个特征图神经元做归一化。
https://blog.csdn.net/hjimce/article/details/50866313
编程细节
convolution:
# forward N, C, H, W = x.shape F, _, HH, WW = w.shape stride, pad = conv_param['stride'], conv_param['pad'] H_out = 1 + (H + 2 * pad - HH) // stride W_out = 1 + (W + 2 * pad - WW) // stride out = np.zeros((N, F, H_out, W_out)) x_pad = np.pad(x, ((0,0), (0,0), (pad,pad), (pad,pad)), mode='constant',constant_values=0) """ np.pad:填充数组的边缘,就是一个padding操作。 第一个参数是需要填充的数组。 第二个参数是填充大小,格式为((before_1, after_1), … (before_N, after_N)),其中(before_1, after_1)表示第1轴两边缘分别填充before_1个和after_1个数值。 最后一个参数表示填充的方式。 """ for i in range(H_out): for j in range(W_out): x_pad_masked = x_pad[:, :, i*stride:i*stride+HH, j*stride:j*stride+WW] for k in range(F): out[:, k, i, j] = np.sum(x_pad_masked * w[k, :, :, :], axis=(1,2,3)) for k in range(F): out[:, k, :, :] += b[k] # backward x, w, b, conv_param = cache N, C, H, W = x.shape F, _, HH, WW = w.shape stride, pad = conv_param['stride'], conv_param['pad'] N, F, H_out, W_out = dout.shape x_pad = np.pad(x, ((0,0), (0,0), (pad,pad), (pad,pad)), mode='constant', constant_values=0) dx = np.zeros_like(x) dx_pad = np.zeros_like(x_pad) dw = np.zeros_like(w) db = np.sum(dout, axis=(0,2,3)) for i in range(H_out): for j in range(W_out): x_pad_masked = x_pad[:, :, i*stride:i*stride+HH, j*stride:j*stride+WW] for k in range(F): # compute dw dw[k,:,:,:] += np.sum(x_pad_masked * (dout[:,k,i,j])[:, None, None, None], axis=0) # 对每个filter,sum用来累加N个样本的影响 for n in range(N): # compute dx_pad dx_pad[n, :, i*stride:i*stride+HH, j*stride:j*stride+WW] += np.sum((w[:,:,:,:] * (dout[n, :, i, j])[:,None ,None, None]), axis=0) # 对每个样本,sum用来累加F个filter带来的梯度 dx = dx_pad[:,:,pad:-pad,pad:-pad]
max pooling:
# forward N, C, H, W = x.shape HH, WW, stride = pool_param['pool_height'], pool_param['pool_width'], pool_param['stride'] H_out = (H - HH) // stride + 1 W_out = (W - WW) // stride + 1 out = np.zeros((N, C, H_out, W_out)) for i in range(H_out): for j in range(W_out): x_masked = x[:, :, i*stride:i*stride+HH, j*stride:j*stride+WW] out[:,:,i,j] = np.max(x_masked, axis=(2,3)) # backward x, pool_param = cache N, C, H, W = x.shape HH, WW, stride = pool_param['pool_height'], pool_param['pool_width'], pool_param['stride'] N, C, H_out, W_out = dout.shape dx = np.zeros_like(x) for i in range(H_out): for j in range(W_out): x_masked = x[:, :, i*stride:i*stride+HH, j*stride:j*stride+WW] max_x_masked = np.max(x_masked,axis=(2,3)) temp_binary_mask = (x_masked == (max_x_masked)[:,:,None,None]) # 如果出现多个数同时为max,那么这多个数都要继承梯度 dx[:, :, i*stride:i*stride+HH, j*stride:j*stride+WW] += temp_binary_mask * (dout[:,:,i,j])[:,:,None,None]
spatial batch normalization:
# forward N, C, H, W = x.shape temp_output, cache = batchnorm_forward(x.transpose(0,3,2,1).reshape((N*H*W,C)), gamma, beta, bn_param) out = temp_output.reshape(N,W,H,C).transpose(0,3,2,1) # backward
PyTorch quick start
首先,import一堆东西:
import torch import torch.nn as nn import torch.optim as optim from torch.autograd import Variable from torch.utils.data import DataLoader from torch.utils.data import sampler import torchvision.datasets as dset import torchvision.transforms as T import numpy as np import timeit
然后,因为本人没有GPU,所以把数据类型定义成CPU的数据类型:
dtype = torch.FloatTensor # the CPU datatype torch.cuda.is_available() # 用这个来看有没有GPU,如果有的话会返回True gpu_dtype = torch.cuda.FloatTensor # the GPU datatype
然后,我们定义一个flatten,它用来把 shape 为 N*C*H*W 的输入展开成 N*?? 的shape,就是一个np.reshape(x,(x.shape[0],-1))
操作。
class Flatten(nn.Module): def forward(self, x): N, C, H, W = x.size() # read in N, C, H, W return x.view(N, -1) # "flatten" the C * H * W values into a single vector per image
接下来,我们定义模型:
''' architecture: [conv - ReLU - BatchNorm - MaxPool] - [conv - ReLU - BatchNorm - MaxPool] - [affine - BatchNorm - ReLU] - [affine - softmax] ''' model_base = nn.Sequential(nn.Conv2d(in_channels=3,out_channels=16, kernel_size=5, stride=1), nn.ReLU(inplace=True), nn.BatchNorm2d(num_features=16), nn.MaxPool2d(kernel_size=2,stride=2), nn.Conv2d(in_channels=16,out_channels=32, kernel_size=3, stride=1), nn.ReLU(inplace=True), nn.BatchNorm2d(num_features=32), nn.MaxPool2d(kernel_size=2,stride=2), Flatten(), nn.Linear(1152,200), # 1152=32*6*6 input size nn.BatchNorm1d(num_features=200), nn.ReLU(inplace=True), nn.Linear(200, 10), # affine layer ) model = model_base.type(dtype) # 先定义base,再把具体数据类型套到base上 loss_fn = nn.CrossEntropyLoss().type(dtype) optimizer = optim.Adam(model.parameters(), lr=1e-3)
cs231n提供了训练和check accuracy的函数,我们直接抄过来:
def train(model, loss_fn, optimizer, num_epochs = 1): for epoch in range(num_epochs): print('Starting epoch %d / %d' % (epoch + 1, num_epochs)) model.train() for t, (x, y) in enumerate(loader_train): x_var = Variable(x.type(dtype)) y_var = Variable(y.type(dtype).long()) scores = model(x_var) loss = loss_fn(scores, y_var) if (t + 1) % print_every == 0: print('t = %d, loss = %.4f' % (t + 1, loss.item())) optimizer.zero_grad() loss.backward() optimizer.step() def check_accuracy(model, loader): if loader.dataset.train: print('Checking accuracy on validation set') else: print('Checking accuracy on test set') num_correct = 0 num_samples = 0 model.eval() # Put the model in test mode (the opposite of model.train(), essentially) for x, y in loader: with torch.no_grad(): x_var = Variable(x.type(dtype)) scores = model(x_var) _, preds = scores.data.cpu().max(1) num_correct += (preds == y).sum() num_samples += preds.size(0) acc = float(num_correct) / num_samples print('Got %d / %d correct (%.2f)' % (num_correct, num_samples, 100 * acc))
然后我们开始训练:
train(model, loss_fn, optimizer, num_epochs=10) check_accuracy(model, loader_val) # validation check_accuracy(best_model, loader_test) # test
TensorFlow quick start
首先import一堆东西:
import tensorflow.compat.v1 as tf tf.compat.v1.disable_eager_execution() import numpy as np import math import timeit import matplotlib.pyplot as plt %matplotlib inline
接下来我们用placeholder(占位符)声明X和y。
X = tf.placeholder(tf.float32, [None, 32, 32, 3]) y = tf.placeholder(tf.int64, [None]) is_training = tf.placeholder(tf.bool) # batchnorm时,train和test不一样,因此要记录一下
声明模型:
def my_model(X,y,is_training): # Conv-Relu-BN conv1act = tf.layers.conv2d(inputs=X, filters=32, padding='same', kernel_size=3, strides=1, activation=tf.nn.relu) bn1act = tf.layers.batch_normalization(inputs=conv1act, training=is_training) # Conv-Relu-BN conv2act = tf.layers.conv2d(inputs=bn1act, filters=64, padding='same', kernel_size=3, strides=1, activation=tf.nn.relu) bn2act = tf.layers.batch_normalization(inputs=conv2act, training=is_training) # Maxpool maxpool1act = tf.layers.max_pooling2d(inputs=bn2act, pool_size=2, strides=2) # Flatten flatten1 = tf.reshape(maxpool1act,[-1,16384]) # FC-Relu-BN fc1 = tf.layers.dense(inputs=flatten1, units=1024, activation=tf.nn.relu) bn3act = tf.layers.batch_normalization(inputs=fc1, training=is_training) # Output FC y_out = tf.layers.dense(inputs=bn3act, units=10, activation=None) return y_out
接下来,声明loss和optimizer。
# clear old variables tf.reset_default_graph() y_out = my_model(X,y,is_training) mean_loss = tf.losses.softmax_cross_entropy(logits=y_out, onehot_labels=tf.one_hot(y,10)) optimizer = tf.train.AdamOptimizer(learning_rate=0.001) # batch normalization in tensorflow requires this extra dependency,好像是一个依赖的意思 extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) with tf.control_dependencies(extra_update_ops): train_step = optimizer.minimize(mean_loss)
cs231n给出了训练的函数,直接粘过来:
def run_model(session, predict, loss_val, Xd, yd, epochs=1, batch_size=64, print_every=100, training=None, plot_losses=False): # have tensorflow compute accuracy correct_prediction = tf.equal(tf.argmax(predict,1), y) accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) # shuffle indicies train_indicies = np.arange(Xd.shape[0]) np.random.shuffle(train_indicies) training_now = training is not None # setting up variables we want to compute (and optimizing) # if we have a training function, add that to things we compute variables = [mean_loss,correct_prediction,accuracy] if training_now: variables[-1] = training # counter iter_cnt = 0 for e in range(epochs): # keep track of losses and accuracy correct = 0 losses = [] # make sure we iterate over the dataset once for i in range(int(math.ceil(Xd.shape[0]/batch_size))): # generate indicies for the batch start_idx = (i*batch_size)%Xd.shape[0] idx = train_indicies[start_idx:start_idx+batch_size] # create a feed dictionary for this batch feed_dict = {X: Xd[idx,:], y: yd[idx], is_training: training_now } # get batch size actual_batch_size = yd[idx].shape[0] # have tensorflow compute loss and correct predictions # and (if given) perform a training step loss, corr, _ = session.run(variables,feed_dict=feed_dict) # aggregate performance stats losses.append(loss*actual_batch_size) correct += np.sum(corr) # print every now and then if training_now and (iter_cnt % print_every) == 0: print("Iteration {0}: with minibatch training loss = {1:.3g} and accuracy of {2:.2g}"\ .format(iter_cnt,loss,np.sum(corr)/actual_batch_size)) iter_cnt += 1 total_correct = correct/Xd.shape[0] total_loss = np.sum(losses)/Xd.shape[0] print("Epoch {2}, Overall loss = {0:.3g} and accuracy of {1:.3g}"\ .format(total_loss,total_correct,e+1)) if plot_losses: plt.plot(losses) plt.grid(True) plt.title('Epoch {} Loss'.format(e+1)) plt.xlabel('minibatch number') plt.ylabel('minibatch loss') plt.show() return total_loss,total_correct
我们开始训练吧:
sess = tf.Session() # session封装了compute graph的状态和相关控制 sess.run(tf.global_variables_initializer()) print('Training') run_model(sess,y_out,mean_loss,X_train,y_train,10,64,100,train_step,True) print('Validation') run_model(sess,y_out,mean_loss,X_val,y_val,1,64)
test一下:
print('Test') run_model(sess,y_out,mean_loss,X_test,y_test,1,64)
本文作者:MoonOut
本文链接:https://www.cnblogs.com/moonout/p/15369592.html
版权声明:本作品采用知识共享署名-非商业性使用-禁止演绎 2.5 中国大陆许可协议进行许可。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步