DL 基础 | cs231n assignment 2
cs231n assignment 2
20210913 - 20211005。
fully-connected nets
基本思想
把各种layer封装起来,就可以modular programming了。
封装一个forward,输入是computational graph节点的输入,输出是节点的输出+需要缓存的信息。
封装一个backward,输入是upstream的derivative即计算图节点输出的derivative,输出是各个计算图节点输入的derivative。
backward时可以根据链式法则,按照维度无脑矩阵乘法算偏导数。
编程细节
x_rsp = x.reshape(x.shape[0], -1) # N*d1*d2*... -> N*D,一行一个数据
A = B.dot(C) # 矩阵乘法
dx = dx.reshape(x.shape) # 把我reshape成你的shape
out = x * (x >= 0) # relu:保留≥0的值,精简numpy写法
dx = (x > 0) * dout # relu的backprop
关于fully-connected layer中的w维度:
layer_input_dim = input_dim
for i, hd in enumerate(hidden_dims):
self.params['W%d'%(i+1)] = weight_scale * np.random.randn(layer_input_dim, hd)
self.params['b%d'%(i+1)] = np.zeros(hd)
if self.use_batchnorm:
self.params['gamma%d'%(i+1)] = np.ones(hd)
self.params['beta%d'%(i+1)] = np.zeros(hd)
layer_input_dim = hd
self.params['W%d'%(self.num_layers)] = weight_scale * np.random.randn(layer_input_dim, num_classes)
self.params['b%d'%(self.num_layers)] = np.zeros(num_classes)
带momentum的stochastic gradient descent:
v = config['momentum'] * v - config['learning_rate'] * dw
# 速度衰减0.9,再加上加速度的方向
next_w = w + v # 用速度更新W
config['velocity'] = v # 记录更新后的速度
RMSProp:
config['cache'] = config['decay_rate'] * config['cache'] + (1 - config['decay_rate']) * (dx**2)
# cache:是以decay_rate为权重的,【原来cache】与【dx平方】的加权平均
next_x = x - config['learning_rate'] * dx / (np.sqrt(config['cache']) + config['epsilon'])
# next_x:走learning_rate的步长,方向为负的 dx除sqrt(cache)+小epsilon(防止除0)。
Adam:
config['t'] += 1
# t:每次更新W都++,用来牵制mb和vb的增长速度
config['m'] = config['beta1'] * config['m'] + (1 - config['beta1']) * dx
# m:是以beta1为权重的,【原来m】与【dx】的加权平均
config['v'] = config['beta2'] * config['v'] + (1 - config['beta2']) * (dx**2)
# v:是以beta2为权重的,【原来v】与【dx平方】的加权平均
mb = config['m'] / (1 - config['beta1']**config['t'])
# mb:原来m 除 1-第一个β参数的t次方,变大了一点点。随着t越来越大,β1**t越来越小,1-β1**t越来越大,除以它就越来越小。因此mb的增加速率越来越小。
vb = config['v'] / (1 - config['beta2']**config['t'])
# vb:原来v 除 1-第二个β参数的t次方,变大了一点点。与上面一样。
next_x = x - config['learning_rate'] * mb / (np.sqrt(vb) + config['epsilon'])
# next_x:走learning_rate的步长,方向为负的 mb除sqrt(vb)+小epsilon。
Adam是怎么一回事呢,就是:
- 我们要stochastic gradient descent,就要瞄准一个下降方向,走learning rate的步长。
- 瞄准什么方向呢,瞄准 mb 除 sqrt(vb)+epsilon 的方向。
- mb是干啥的呢,它是 m 除 (1-β1^t)。
- 除(1-β1t)是用来缓慢减小mb值的,随着t累加,β1t减小,1-β1^t增大,除它又减小,因此除它是用来缓慢减小mb值的。
- m是干啥的呢,它其实是momentum,更新公式是 原m与现dx的加权平均。
- 那vb是干啥的呢,它是 v 除 (1-β2^t)。
- 除(1-β2^t)啊,估计也是用来缓慢减小vb值的。虽然vb最后要放在前进方向的分母上,好矛盾诶。
- v是干啥的呢,是RMSProp的奇妙操作,更新公式是 原v与现dx²的加权平均。
- 因此,Adam综合了momentum和RMSProp,又沿着momentum方向前进,又除平方dx,同时还奇妙地用【除(1-β^t)】牵制两者。
复习multiclass svm loss和softmax loss
multiclass svm loss & derivative
好像又被叫做hinge loss。
N = x.shape[0]
correct_class_scores = x[np.arange(N), y] # 正确类别的分数,N*1的向量
margins = np.maximum(0, x - correct_class_scores[:, np.newaxis] + 1.0)
margins[np.arange(N), y] = 0 # 我们只计算错误类别
loss = np.sum(margins) / N # 对N个样本求loss,然后做平均作为最后的loss
num_pos = np.sum(margins > 0, axis=1)
dx = np.zeros_like(x) # x形状的全0矩阵
dx[margins > 0] = 1 # loss增大方向:错误类别分数增加
dx[np.arange(N), y] -= num_pos
# loss增大方向:每个【错误类别分数增加】都对应一个【正确类别分数减小】
dx /= N # 对N个样本效果的平均
return loss, dx
softmax loss & derivative
又被叫做cross entropy loss。
shifted_logits = x - np.max(x, axis=1, keepdims=True)
# 相当于exp(x)/exp(max(x))
Z = np.sum(np.exp(shifted_logits), axis=1, keepdims=True)
# 相当于对exp(x)/exp(max(x))求sum,sum(exp(x))/exp(max(x))
# 也就是sum(exp(x))再除exp的max(x)
log_probs = shifted_logits - np.log(Z)
# 相当于exp(x)/sum(exp(x))的log,也就是概率的log,这样算省了很多exp
# 关于貌似无用的“减去max(x)”:https://zhuanlan.zhihu.com/p/92714192
probs = np.exp(log_probs) # 这是概率
N = x.shape[0]
loss = -np.sum(log_probs[np.arange(N), y]) / N
# loss就是-log(正确概率),最后对N个样本取平均
dx = probs.copy() # 首先dx=算出来的概率
dx[np.arange(N), y] -= 1 # 然后所有正确分类的概率-=1
# 不知道为什么反正就这么算
dx /= N # 最后对N个样本做平均,因为每个样本对loss只贡献了1/N?
return loss, dx
batch normalization
基本思想
先把一个minibatch的数据变成0均值1方差,然后再乘γ加β。这是一个特殊的层。
它一般被用在ReLU层前面。
编程细节
forward:
sample_mean = np.mean(x,axis=0)
sample_var = np.var(x,axis=0)
x_hat = (x - sample_mean) / (np.sqrt(sample_var+eps))
out = gamma * x_hat + beta
cache = (gamma, x, sample_mean, sample_var, x_hat)
running_mean = momentum * running_mean + (1-momentum) * sample_mean
running_var = momentum * running_var + (1-momentum) * sample_var
# test的时候
scale = gamma / (np.sqrt(running_var + eps))
out = x * scale + (beta - running_mean * scale)
# 其实没什么区别,只是这样好像计算量小一点,能用标量尽量不用向量
backward:
# 估计我下次看也看不懂了
# 大意就是,x若变化,均值和方差也会变,求导时也要考虑这个。
gamma, x, sample_mean, sample_var, eps, x_hat = cache
N = x.shape[0]
dbeta = np.sum(dout, axis=0) # 是的,是sum,把每一个样本的影响累加
dgamma = np.sum(dout*x_hat, axis=0)
dy_wrt_dmean = -gamma / np.sqrt(sample_var+eps) * dout
dy_wrt_dvar = -0.5 * gamma * np.power(sample_var+eps,-1.5)
dmean_wrt_dx = 1.0 / N # 是的,每个人都贡献了1/N。直接用1可能会整数除法?
dvar_wrt_dx = 2.0 / N * (x-sample_mean) # 根据方差的计算公式
dy_wrt_dx = gamma / np.sqrt(sample_var+eps) * dout
dx = dy_wrt_dx + dy_wrt_dmean * dmean_wrt_dx + dy_wrt_dvar * dvar_wrt_dx
# 正确性存疑,虽然抄的别人的代码,但是有误差
方差计算公式:
dropout
基本思想
原dropout:train的时候以p的概率随机把neuron赋0,test的时候把整层的输出乘(1-p)。
inverted dropout:train的时候以p的概率随机把neuron赋0,也就是保留了(1-p)的原数值,然后再把所有数值除(1-p)(就像做平均一样),试图通过放大留下的(1-p)个人的影响,假装什么都没发生。test的时候,不需要做任何事情。
网络结构:affine - [batch norm] - relu - [dropout]。
编程细节
# forward
mask = (np.random.rand(*x.shape) >= p) / (1-p)
out = x * mask
# backward
dx = dout * mask
在【fully connect - batch norm - relu - dropout】结构中添加dropout:forward时,在最后把输出dropout一下;backward时,把上一层的输出先做一个dropout backward。
convolutional networks
基本思想
convolution
input的shape是(N, C, H, W),其中N是样本数量,C是channel个数(RGB),HW是高和宽。
filter的shape是(F, C, HH, WW),F是卷积核个数,HH是卷积核高,WW是卷积核宽。
output的shape是(N, F, H_out, W_out),对每一个样本 用F个filter 做卷积操作,因此第一个dimension是N,第二个是F。H_out和W_out是卷积后的高和宽。
卷积还有一个biases参数,是长度为F的向量,负责整体平移卷积后的map。
还有两个超参数:stride步长、pad填充。
H_out和W_out这样计算:
H_out = 1 + (H + 2 * pad - HH) // stride
W_out = 1 + (W + 2 * pad - WW) // stride
算卷积结果的时候,这样写:(naive)
out[:, f, i, j] = np.sum(x_masked * w[f,:,:,:], axis=(1,2,3))
max pooling
input的shape是(N, C, H, W),pooling的参数有HH、WW和stride。
我们每次考虑HH*WW的方形区域,记录该区域的最大值,每次走stride的步长。
输出的shape是(N, C, H_out, W_out),其中H_out和W_out这样计算(同卷积):
H_out = 1 + (H - HH) // stride
W_out = 1 + (W - WW) // stride
计算max的时候,使用np.max(x_masked, axis=(2,3))
。
spatial batch normalization
设input为四维矩阵 (N, C, H, W)。在cnn中,我们把每个 feather map 看成是一个特征处理(一个神经元),因此在使用 spatial batchnorm 的时候,mini-batch size 就是:N*H*W,于是对于每个特征图都只有两个可学习参数:γ、β。
也就是说,求取所有样本的某一个特征图的【所有】神经元的均值方差,然后对这个特征图神经元做归一化。
https://blog.csdn.net/hjimce/article/details/50866313
编程细节
convolution:
# forward
N, C, H, W = x.shape
F, _, HH, WW = w.shape
stride, pad = conv_param['stride'], conv_param['pad']
H_out = 1 + (H + 2 * pad - HH) // stride
W_out = 1 + (W + 2 * pad - WW) // stride
out = np.zeros((N, F, H_out, W_out))
x_pad = np.pad(x, ((0,0), (0,0), (pad,pad), (pad,pad)), mode='constant',constant_values=0)
"""
np.pad:填充数组的边缘,就是一个padding操作。
第一个参数是需要填充的数组。
第二个参数是填充大小,格式为((before_1, after_1), … (before_N, after_N)),其中(before_1, after_1)表示第1轴两边缘分别填充before_1个和after_1个数值。
最后一个参数表示填充的方式。
"""
for i in range(H_out):
for j in range(W_out):
x_pad_masked = x_pad[:, :, i*stride:i*stride+HH, j*stride:j*stride+WW]
for k in range(F):
out[:, k, i, j] = np.sum(x_pad_masked * w[k, :, :, :], axis=(1,2,3))
for k in range(F):
out[:, k, :, :] += b[k]
# backward
x, w, b, conv_param = cache
N, C, H, W = x.shape
F, _, HH, WW = w.shape
stride, pad = conv_param['stride'], conv_param['pad']
N, F, H_out, W_out = dout.shape
x_pad = np.pad(x, ((0,0), (0,0), (pad,pad), (pad,pad)), mode='constant', constant_values=0)
dx = np.zeros_like(x)
dx_pad = np.zeros_like(x_pad)
dw = np.zeros_like(w)
db = np.sum(dout, axis=(0,2,3))
for i in range(H_out):
for j in range(W_out):
x_pad_masked = x_pad[:, :, i*stride:i*stride+HH, j*stride:j*stride+WW]
for k in range(F): # compute dw
dw[k,:,:,:] += np.sum(x_pad_masked * (dout[:,k,i,j])[:, None, None, None], axis=0)
# 对每个filter,sum用来累加N个样本的影响
for n in range(N): # compute dx_pad
dx_pad[n, :, i*stride:i*stride+HH, j*stride:j*stride+WW] += np.sum((w[:,:,:,:] * (dout[n, :, i, j])[:,None ,None, None]), axis=0)
# 对每个样本,sum用来累加F个filter带来的梯度
dx = dx_pad[:,:,pad:-pad,pad:-pad]
max pooling:
# forward
N, C, H, W = x.shape
HH, WW, stride = pool_param['pool_height'], pool_param['pool_width'], pool_param['stride']
H_out = (H - HH) // stride + 1
W_out = (W - WW) // stride + 1
out = np.zeros((N, C, H_out, W_out))
for i in range(H_out):
for j in range(W_out):
x_masked = x[:, :, i*stride:i*stride+HH, j*stride:j*stride+WW]
out[:,:,i,j] = np.max(x_masked, axis=(2,3))
# backward
x, pool_param = cache
N, C, H, W = x.shape
HH, WW, stride = pool_param['pool_height'], pool_param['pool_width'], pool_param['stride']
N, C, H_out, W_out = dout.shape
dx = np.zeros_like(x)
for i in range(H_out):
for j in range(W_out):
x_masked = x[:, :, i*stride:i*stride+HH, j*stride:j*stride+WW]
max_x_masked = np.max(x_masked,axis=(2,3))
temp_binary_mask = (x_masked == (max_x_masked)[:,:,None,None])
# 如果出现多个数同时为max,那么这多个数都要继承梯度
dx[:, :, i*stride:i*stride+HH, j*stride:j*stride+WW] += temp_binary_mask * (dout[:,:,i,j])[:,:,None,None]
spatial batch normalization:
# forward
N, C, H, W = x.shape
temp_output, cache = batchnorm_forward(x.transpose(0,3,2,1).reshape((N*H*W,C)), gamma, beta, bn_param)
out = temp_output.reshape(N,W,H,C).transpose(0,3,2,1)
# backward
PyTorch quick start
首先,import一堆东西:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
from torch.utils.data import DataLoader
from torch.utils.data import sampler
import torchvision.datasets as dset
import torchvision.transforms as T
import numpy as np
import timeit
然后,因为本人没有GPU,所以把数据类型定义成CPU的数据类型:
dtype = torch.FloatTensor # the CPU datatype
torch.cuda.is_available() # 用这个来看有没有GPU,如果有的话会返回True
gpu_dtype = torch.cuda.FloatTensor # the GPU datatype
然后,我们定义一个flatten,它用来把 shape 为 N*C*H*W 的输入展开成 N*?? 的shape,就是一个np.reshape(x,(x.shape[0],-1))
操作。
class Flatten(nn.Module):
def forward(self, x):
N, C, H, W = x.size() # read in N, C, H, W
return x.view(N, -1) # "flatten" the C * H * W values into a single vector per image
接下来,我们定义模型:
'''
architecture:
[conv - ReLU - BatchNorm - MaxPool] -
[conv - ReLU - BatchNorm - MaxPool] -
[affine - BatchNorm - ReLU] -
[affine - softmax]
'''
model_base = nn.Sequential(nn.Conv2d(in_channels=3,out_channels=16, kernel_size=5, stride=1),
nn.ReLU(inplace=True),
nn.BatchNorm2d(num_features=16),
nn.MaxPool2d(kernel_size=2,stride=2),
nn.Conv2d(in_channels=16,out_channels=32, kernel_size=3, stride=1),
nn.ReLU(inplace=True),
nn.BatchNorm2d(num_features=32),
nn.MaxPool2d(kernel_size=2,stride=2),
Flatten(),
nn.Linear(1152,200), # 1152=32*6*6 input size
nn.BatchNorm1d(num_features=200),
nn.ReLU(inplace=True),
nn.Linear(200, 10), # affine layer
)
model = model_base.type(dtype) # 先定义base,再把具体数据类型套到base上
loss_fn = nn.CrossEntropyLoss().type(dtype)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
cs231n提供了训练和check accuracy的函数,我们直接抄过来:
def train(model, loss_fn, optimizer, num_epochs = 1):
for epoch in range(num_epochs):
print('Starting epoch %d / %d' % (epoch + 1, num_epochs))
model.train()
for t, (x, y) in enumerate(loader_train):
x_var = Variable(x.type(dtype))
y_var = Variable(y.type(dtype).long())
scores = model(x_var)
loss = loss_fn(scores, y_var)
if (t + 1) % print_every == 0:
print('t = %d, loss = %.4f' % (t + 1, loss.item()))
optimizer.zero_grad()
loss.backward()
optimizer.step()
def check_accuracy(model, loader):
if loader.dataset.train:
print('Checking accuracy on validation set')
else:
print('Checking accuracy on test set')
num_correct = 0
num_samples = 0
model.eval() # Put the model in test mode (the opposite of model.train(), essentially)
for x, y in loader:
with torch.no_grad():
x_var = Variable(x.type(dtype))
scores = model(x_var)
_, preds = scores.data.cpu().max(1)
num_correct += (preds == y).sum()
num_samples += preds.size(0)
acc = float(num_correct) / num_samples
print('Got %d / %d correct (%.2f)' % (num_correct, num_samples, 100 * acc))
然后我们开始训练:
train(model, loss_fn, optimizer, num_epochs=10)
check_accuracy(model, loader_val) # validation
check_accuracy(best_model, loader_test) # test
TensorFlow quick start
首先import一堆东西:
import tensorflow.compat.v1 as tf
tf.compat.v1.disable_eager_execution()
import numpy as np
import math
import timeit
import matplotlib.pyplot as plt
%matplotlib inline
接下来我们用placeholder(占位符)声明X和y。
X = tf.placeholder(tf.float32, [None, 32, 32, 3])
y = tf.placeholder(tf.int64, [None])
is_training = tf.placeholder(tf.bool) # batchnorm时,train和test不一样,因此要记录一下
声明模型:
def my_model(X,y,is_training):
# Conv-Relu-BN
conv1act = tf.layers.conv2d(inputs=X, filters=32, padding='same', kernel_size=3, strides=1, activation=tf.nn.relu)
bn1act = tf.layers.batch_normalization(inputs=conv1act, training=is_training)
# Conv-Relu-BN
conv2act = tf.layers.conv2d(inputs=bn1act, filters=64, padding='same', kernel_size=3, strides=1,
activation=tf.nn.relu)
bn2act = tf.layers.batch_normalization(inputs=conv2act, training=is_training)
# Maxpool
maxpool1act = tf.layers.max_pooling2d(inputs=bn2act, pool_size=2, strides=2)
# Flatten
flatten1 = tf.reshape(maxpool1act,[-1,16384])
# FC-Relu-BN
fc1 = tf.layers.dense(inputs=flatten1, units=1024, activation=tf.nn.relu)
bn3act = tf.layers.batch_normalization(inputs=fc1, training=is_training)
# Output FC
y_out = tf.layers.dense(inputs=bn3act, units=10, activation=None)
return y_out
接下来,声明loss和optimizer。
# clear old variables
tf.reset_default_graph()
y_out = my_model(X,y,is_training)
mean_loss = tf.losses.softmax_cross_entropy(logits=y_out, onehot_labels=tf.one_hot(y,10))
optimizer = tf.train.AdamOptimizer(learning_rate=0.001)
# batch normalization in tensorflow requires this extra dependency,好像是一个依赖的意思
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(extra_update_ops):
train_step = optimizer.minimize(mean_loss)
cs231n给出了训练的函数,直接粘过来:
def run_model(session, predict, loss_val, Xd, yd,
epochs=1, batch_size=64, print_every=100,
training=None, plot_losses=False):
# have tensorflow compute accuracy
correct_prediction = tf.equal(tf.argmax(predict,1), y)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
# shuffle indicies
train_indicies = np.arange(Xd.shape[0])
np.random.shuffle(train_indicies)
training_now = training is not None
# setting up variables we want to compute (and optimizing)
# if we have a training function, add that to things we compute
variables = [mean_loss,correct_prediction,accuracy]
if training_now:
variables[-1] = training
# counter
iter_cnt = 0
for e in range(epochs):
# keep track of losses and accuracy
correct = 0
losses = []
# make sure we iterate over the dataset once
for i in range(int(math.ceil(Xd.shape[0]/batch_size))):
# generate indicies for the batch
start_idx = (i*batch_size)%Xd.shape[0]
idx = train_indicies[start_idx:start_idx+batch_size]
# create a feed dictionary for this batch
feed_dict = {X: Xd[idx,:],
y: yd[idx],
is_training: training_now }
# get batch size
actual_batch_size = yd[idx].shape[0]
# have tensorflow compute loss and correct predictions
# and (if given) perform a training step
loss, corr, _ = session.run(variables,feed_dict=feed_dict)
# aggregate performance stats
losses.append(loss*actual_batch_size)
correct += np.sum(corr)
# print every now and then
if training_now and (iter_cnt % print_every) == 0:
print("Iteration {0}: with minibatch training loss = {1:.3g} and accuracy of {2:.2g}"\
.format(iter_cnt,loss,np.sum(corr)/actual_batch_size))
iter_cnt += 1
total_correct = correct/Xd.shape[0]
total_loss = np.sum(losses)/Xd.shape[0]
print("Epoch {2}, Overall loss = {0:.3g} and accuracy of {1:.3g}"\
.format(total_loss,total_correct,e+1))
if plot_losses:
plt.plot(losses)
plt.grid(True)
plt.title('Epoch {} Loss'.format(e+1))
plt.xlabel('minibatch number')
plt.ylabel('minibatch loss')
plt.show()
return total_loss,total_correct
我们开始训练吧:
sess = tf.Session() # session封装了compute graph的状态和相关控制
sess.run(tf.global_variables_initializer())
print('Training')
run_model(sess,y_out,mean_loss,X_train,y_train,10,64,100,train_step,True)
print('Validation')
run_model(sess,y_out,mean_loss,X_val,y_val,1,64)
test一下:
print('Test')
run_model(sess,y_out,mean_loss,X_test,y_test,1,64)