DL 基础 | cs231n assignment 2

cs231n assignment 2

20210913 - 20211005。

fully-connected nets

基本思想

把各种layer封装起来，就可以modular programming了。

封装一个forward，输入是computational graph节点的输入，输出是节点的输出+需要缓存的信息。

封装一个backward，输入是upstream的derivative即计算图节点输出的derivative，输出是各个计算图节点输入的derivative。

backward时可以根据链式法则，按照维度无脑矩阵乘法算偏导数。

编程细节

x_rsp = x.reshape(x.shape[0], -1) # N*d1*d2*... -> N*D，一行一个数据
A = B.dot(C) # 矩阵乘法
dx = dx.reshape(x.shape) # 把我reshape成你的shape
out = x * (x >= 0) # relu：保留≥0的值，精简numpy写法
dx = (x > 0) * dout # relu的backprop

关于fully-connected layer中的w维度：

layer_input_dim = input_dim
for i, hd in enumerate(hidden_dims):
    self.params['W%d'%(i+1)] = weight_scale * np.random.randn(layer_input_dim, hd)
    self.params['b%d'%(i+1)] = np.zeros(hd)
    if self.use_batchnorm:
        self.params['gamma%d'%(i+1)] = np.ones(hd)
        self.params['beta%d'%(i+1)] = np.zeros(hd)
    layer_input_dim = hd
self.params['W%d'%(self.num_layers)] = weight_scale * np.random.randn(layer_input_dim, num_classes)
self.params['b%d'%(self.num_layers)] = np.zeros(num_classes)

带momentum的stochastic gradient descent：

v = config['momentum'] * v - config['learning_rate'] * dw
# 速度衰减0.9，再加上加速度的方向
next_w = w + v # 用速度更新W
config['velocity'] = v # 记录更新后的速度

RMSProp：

config['cache'] = config['decay_rate'] * config['cache'] + (1 - config['decay_rate']) * (dx**2)
# cache：是以decay_rate为权重的，【原来cache】与【dx平方】的加权平均
next_x = x - config['learning_rate'] * dx / (np.sqrt(config['cache']) + config['epsilon'])
# next_x：走learning_rate的步长，方向为负的 dx除sqrt(cache)+小epsilon（防止除0）。

Adam：

config['t'] += 1
# t：每次更新W都++，用来牵制mb和vb的增长速度
config['m'] = config['beta1'] * config['m'] + (1 - config['beta1']) * dx
# m：是以beta1为权重的，【原来m】与【dx】的加权平均
config['v'] = config['beta2'] * config['v'] + (1 - config['beta2']) * (dx**2)
# v：是以beta2为权重的，【原来v】与【dx平方】的加权平均
mb = config['m'] / (1 - config['beta1']**config['t'])
# mb：原来m 除 1-第一个β参数的t次方，变大了一点点。随着t越来越大，β1**t越来越小，1-β1**t越来越大，除以它就越来越小。因此mb的增加速率越来越小。
vb = config['v'] / (1 - config['beta2']**config['t'])
# vb：原来v 除 1-第二个β参数的t次方，变大了一点点。与上面一样。
next_x = x - config['learning_rate'] * mb / (np.sqrt(vb) + config['epsilon'])
# next_x：走learning_rate的步长，方向为负的 mb除sqrt(vb)+小epsilon。

Adam是怎么一回事呢，就是：

我们要stochastic gradient descent，就要瞄准一个下降方向，走learning rate的步长。
瞄准什么方向呢，瞄准 mb 除 sqrt(vb)+epsilon 的方向。
mb是干啥的呢，它是 m 除 (1-β1^t)。
- 除(1-β1^{t)是用来缓慢减小mb值的，随着t累加，β1}t减小，1-β1^t增大，除它又减小，因此除它是用来缓慢减小mb值的。
- m是干啥的呢，它其实是momentum，更新公式是原m与现dx的加权平均。
那vb是干啥的呢，它是 v 除 (1-β2^t)。
- 除(1-β2^t)啊，估计也是用来缓慢减小vb值的。虽然vb最后要放在前进方向的分母上，好矛盾诶。
- v是干啥的呢，是RMSProp的奇妙操作，更新公式是原v与现dx²的加权平均。
因此，Adam综合了momentum和RMSProp，又沿着momentum方向前进，又除平方dx，同时还奇妙地用【除(1-β^t)】牵制两者。

复习multiclass svm loss和softmax loss

multiclass svm loss & derivative

好像又被叫做hinge loss。

N = x.shape[0]
correct_class_scores = x[np.arange(N), y] # 正确类别的分数，N*1的向量
margins = np.maximum(0, x - correct_class_scores[:, np.newaxis] + 1.0)
margins[np.arange(N), y] = 0 # 我们只计算错误类别
loss = np.sum(margins) / N # 对N个样本求loss，然后做平均作为最后的loss
num_pos = np.sum(margins > 0, axis=1)
dx = np.zeros_like(x) # x形状的全0矩阵
dx[margins > 0] = 1 # loss增大方向：错误类别分数增加
dx[np.arange(N), y] -= num_pos
# loss增大方向：每个【错误类别分数增加】都对应一个【正确类别分数减小】
dx /= N # 对N个样本效果的平均
return loss, dx

softmax loss & derivative

又被叫做cross entropy loss。

shifted_logits = x - np.max(x, axis=1, keepdims=True)
# 相当于exp(x)/exp(max(x))
Z = np.sum(np.exp(shifted_logits), axis=1, keepdims=True)
# 相当于对exp(x)/exp(max(x))求sum，sum(exp(x))/exp(max(x))
# 也就是sum(exp(x))再除exp的max(x)
log_probs = shifted_logits - np.log(Z)
# 相当于exp(x)/sum(exp(x))的log，也就是概率的log，这样算省了很多exp
# 关于貌似无用的“减去max(x)”：https://zhuanlan.zhihu.com/p/92714192
probs = np.exp(log_probs) # 这是概率
N = x.shape[0]
loss = -np.sum(log_probs[np.arange(N), y]) / N
# loss就是-log(正确概率)，最后对N个样本取平均
dx = probs.copy() # 首先dx=算出来的概率
dx[np.arange(N), y] -= 1 # 然后所有正确分类的概率-=1
# 不知道为什么反正就这么算
dx /= N # 最后对N个样本做平均，因为每个样本对loss只贡献了1/N？
return loss, dx

batch normalization

基本思想

先把一个minibatch的数据变成0均值1方差，然后再乘γ加β。这是一个特殊的层。

它一般被用在ReLU层前面。

编程细节

forward：

sample_mean = np.mean(x,axis=0)
sample_var = np.var(x,axis=0)
x_hat = (x - sample_mean) / (np.sqrt(sample_var+eps))
out = gamma * x_hat + beta
cache = (gamma, x, sample_mean, sample_var, x_hat)
running_mean = momentum * running_mean + (1-momentum) * sample_mean
running_var = momentum * running_var + (1-momentum) * sample_var

# test的时候
scale = gamma / (np.sqrt(running_var  + eps))
out = x * scale + (beta - running_mean * scale)
# 其实没什么区别，只是这样好像计算量小一点，能用标量尽量不用向量

backward：

# 估计我下次看也看不懂了
# 大意就是，x若变化，均值和方差也会变，求导时也要考虑这个。
gamma, x, sample_mean, sample_var, eps, x_hat = cache
N = x.shape[0]

dbeta = np.sum(dout, axis=0) # 是的，是sum，把每一个样本的影响累加
dgamma = np.sum(dout*x_hat, axis=0)

dy_wrt_dmean = -gamma / np.sqrt(sample_var+eps) * dout
dy_wrt_dvar = -0.5 * gamma * np.power(sample_var+eps,-1.5)
dmean_wrt_dx = 1.0 / N # 是的，每个人都贡献了1/N。直接用1可能会整数除法？
dvar_wrt_dx = 2.0 / N * (x-sample_mean) # 根据方差的计算公式
dy_wrt_dx = gamma / np.sqrt(sample_var+eps) * dout
dx = dy_wrt_dx + dy_wrt_dmean * dmean_wrt_dx + dy_wrt_dvar * dvar_wrt_dx
# 正确性存疑，虽然抄的别人的代码，但是有误差

方差计算公式：

\[s^2=\frac{(x_1-\mu)^2+(x_2-\mu)^2+\cdots+(x_N-\mu)^2}{N} \]

dropout

基本思想

原dropout：train的时候以p的概率随机把neuron赋0，test的时候把整层的输出乘(1-p)。

inverted dropout：train的时候以p的概率随机把neuron赋0，也就是保留了(1-p)的原数值，然后再把所有数值除(1-p)（就像做平均一样），试图通过放大留下的(1-p)个人的影响，假装什么都没发生。test的时候，不需要做任何事情。

网络结构：affine - [batch norm] - relu - [dropout]。

编程细节

# forward
mask = (np.random.rand(*x.shape) >= p) / (1-p)
out = x * mask
# backward
dx = dout * mask

在【fully connect - batch norm - relu - dropout】结构中添加dropout：forward时，在最后把输出dropout一下；backward时，把上一层的输出先做一个dropout backward。

convolutional networks

基本思想

convolution

input的shape是(N, C, H, W)，其中N是样本数量，C是channel个数（RGB），HW是高和宽。

filter的shape是(F, C, HH, WW)，F是卷积核个数，HH是卷积核高，WW是卷积核宽。

output的shape是(N, F, H_out, W_out)，对每一个样本用F个filter 做卷积操作，因此第一个dimension是N，第二个是F。H_out和W_out是卷积后的高和宽。

卷积还有一个biases参数，是长度为F的向量，负责整体平移卷积后的map。

还有两个超参数：stride步长、pad填充。

H_out和W_out这样计算：

H_out = 1 + (H + 2 * pad - HH) // stride
W_out = 1 + (W + 2 * pad - WW) // stride

算卷积结果的时候，这样写：（naive）

out[:, f, i, j] = np.sum(x_masked * w[f,:,:,:], axis=(1,2,3))

max pooling

input的shape是(N, C, H, W)，pooling的参数有HH、WW和stride。

我们每次考虑HH*WW的方形区域，记录该区域的最大值，每次走stride的步长。

输出的shape是(N, C, H_out, W_out)，其中H_out和W_out这样计算（同卷积）：

H_out = 1 + (H - HH) // stride
W_out = 1 + (W - WW) // stride

计算max的时候，使用np.max(x_masked, axis=(2,3))。

spatial batch normalization

设input为四维矩阵 (N, C, H, W)。在cnn中，我们把每个 feather map 看成是一个特征处理（一个神经元），因此在使用 spatial batchnorm 的时候，mini-batch size 就是：N*H*W，于是对于每个特征图都只有两个可学习参数：γ、β。

也就是说，求取所有样本的某一个特征图的【所有】神经元的均值方差，然后对这个特征图神经元做归一化。

https://blog.csdn.net/hjimce/article/details/50866313

编程细节

convolution：

# forward
N, C, H, W = x.shape
F, _, HH, WW = w.shape
stride, pad = conv_param['stride'], conv_param['pad']
H_out = 1 + (H + 2 * pad - HH) // stride
W_out = 1 + (W + 2 * pad - WW) // stride
out = np.zeros((N, F, H_out, W_out))

x_pad = np.pad(x, ((0,0), (0,0), (pad,pad), (pad,pad)), mode='constant',constant_values=0)
"""
np.pad：填充数组的边缘，就是一个padding操作。
第一个参数是需要填充的数组。
第二个参数是填充大小，格式为((before_1, after_1), … (before_N, after_N))，其中(before_1, after_1)表示第1轴两边缘分别填充before_1个和after_1个数值。
最后一个参数表示填充的方式。
"""
for i in range(H_out):
    for j in range(W_out):
        x_pad_masked = x_pad[:, :, i*stride:i*stride+HH, j*stride:j*stride+WW]
        for k in range(F):
            out[:, k, i, j] = np.sum(x_pad_masked * w[k, :, :, :], axis=(1,2,3))
for k in range(F):
    out[:, k, :, :] += b[k]

# backward
x, w, b, conv_param = cache
N, C, H, W = x.shape
F, _, HH, WW = w.shape
stride, pad = conv_param['stride'], conv_param['pad']
N, F, H_out, W_out = dout.shape

x_pad = np.pad(x, ((0,0), (0,0), (pad,pad), (pad,pad)), mode='constant', constant_values=0)
dx = np.zeros_like(x)
dx_pad = np.zeros_like(x_pad)
dw = np.zeros_like(w)
db = np.sum(dout, axis=(0,2,3))

for i in range(H_out):
    for j in range(W_out):
        x_pad_masked = x_pad[:, :, i*stride:i*stride+HH, j*stride:j*stride+WW]
        for k in range(F): # compute dw
            dw[k,:,:,:] += np.sum(x_pad_masked * (dout[:,k,i,j])[:, None, None, None], axis=0)
            # 对每个filter，sum用来累加N个样本的影响
		for n in range(N): # compute dx_pad
            dx_pad[n, :, i*stride:i*stride+HH, j*stride:j*stride+WW] += np.sum((w[:,:,:,:] * (dout[n, :, i, j])[:,None ,None, None]), axis=0)
            # 对每个样本，sum用来累加F个filter带来的梯度
dx = dx_pad[:,:,pad:-pad,pad:-pad]

max pooling：

# forward
N, C, H, W = x.shape
HH, WW, stride = pool_param['pool_height'], pool_param['pool_width'], pool_param['stride']
H_out = (H - HH) // stride + 1
W_out = (W - WW) // stride + 1
out = np.zeros((N, C, H_out, W_out))
for i in range(H_out):
    for j in range(W_out):
        x_masked = x[:, :, i*stride:i*stride+HH, j*stride:j*stride+WW]
        out[:,:,i,j] = np.max(x_masked, axis=(2,3))

# backward
x, pool_param = cache
N, C, H, W = x.shape
HH, WW, stride = pool_param['pool_height'], pool_param['pool_width'], pool_param['stride']
N, C, H_out, W_out = dout.shape
dx = np.zeros_like(x)

for i in range(H_out):
    for j in range(W_out):
        x_masked = x[:, :, i*stride:i*stride+HH, j*stride:j*stride+WW]
        max_x_masked = np.max(x_masked,axis=(2,3))
        temp_binary_mask = (x_masked == (max_x_masked)[:,:,None,None])
        # 如果出现多个数同时为max，那么这多个数都要继承梯度
        dx[:, :, i*stride:i*stride+HH, j*stride:j*stride+WW] += temp_binary_mask * (dout[:,:,i,j])[:,:,None,None]

spatial batch normalization：

# forward
N, C, H, W = x.shape
temp_output, cache = batchnorm_forward(x.transpose(0,3,2,1).reshape((N*H*W,C)), gamma, beta, bn_param)
out = temp_output.reshape(N,W,H,C).transpose(0,3,2,1)

# backward

PyTorch quick start

首先，import一堆东西：

import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
from torch.utils.data import DataLoader
from torch.utils.data import sampler

import torchvision.datasets as dset
import torchvision.transforms as T

import numpy as np

import timeit

然后，因为本人没有GPU，所以把数据类型定义成CPU的数据类型：

dtype = torch.FloatTensor # the CPU datatype
torch.cuda.is_available() # 用这个来看有没有GPU，如果有的话会返回True
gpu_dtype = torch.cuda.FloatTensor # the GPU datatype

然后，我们定义一个flatten，它用来把 shape 为 N*C*H*W 的输入展开成 N*?? 的shape，就是一个np.reshape(x,(x.shape[0],-1))操作。

class Flatten(nn.Module):
    def forward(self, x):
        N, C, H, W = x.size() # read in N, C, H, W
        return x.view(N, -1)  # "flatten" the C * H * W values into a single vector per image

接下来，我们定义模型：

'''
architecture:
[conv - ReLU - BatchNorm - MaxPool] -
[conv - ReLU - BatchNorm - MaxPool] -
[affine - BatchNorm - ReLU] -
[affine - softmax]
'''
model_base = nn.Sequential(nn.Conv2d(in_channels=3,out_channels=16, kernel_size=5, stride=1),
                           nn.ReLU(inplace=True),
                           nn.BatchNorm2d(num_features=16),
                           nn.MaxPool2d(kernel_size=2,stride=2),
                           nn.Conv2d(in_channels=16,out_channels=32, kernel_size=3, stride=1),
                           nn.ReLU(inplace=True),
                           nn.BatchNorm2d(num_features=32),
                           nn.MaxPool2d(kernel_size=2,stride=2),
                           Flatten(),
                           nn.Linear(1152,200),  # 1152=32*6*6 input size
                           nn.BatchNorm1d(num_features=200),
                           nn.ReLU(inplace=True),
                           nn.Linear(200, 10), # affine layer
                          )

model = model_base.type(dtype) # 先定义base，再把具体数据类型套到base上
loss_fn = nn.CrossEntropyLoss().type(dtype)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

cs231n提供了训练和check accuracy的函数，我们直接抄过来：

def train(model, loss_fn, optimizer, num_epochs = 1):
    for epoch in range(num_epochs):
        print('Starting epoch %d / %d' % (epoch + 1, num_epochs))
        model.train()
        for t, (x, y) in enumerate(loader_train):
            x_var = Variable(x.type(dtype))
            y_var = Variable(y.type(dtype).long())
            scores = model(x_var)
            loss = loss_fn(scores, y_var)
            
            if (t + 1) % print_every == 0:
                print('t = %d, loss = %.4f' % (t + 1, loss.item()))

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

def check_accuracy(model, loader):
    if loader.dataset.train:
        print('Checking accuracy on validation set')
    else:
        print('Checking accuracy on test set')   
    num_correct = 0
    num_samples = 0
    model.eval() # Put the model in test mode (the opposite of model.train(), essentially)
    for x, y in loader:
        with torch.no_grad():
            x_var = Variable(x.type(dtype))

        scores = model(x_var)
        _, preds = scores.data.cpu().max(1)
        num_correct += (preds == y).sum()
        num_samples += preds.size(0)
    acc = float(num_correct) / num_samples
    print('Got %d / %d correct (%.2f)' % (num_correct, num_samples, 100 * acc))

然后我们开始训练：

train(model, loss_fn, optimizer, num_epochs=10)
check_accuracy(model, loader_val) # validation
check_accuracy(best_model, loader_test) # test

TensorFlow quick start

首先import一堆东西：

import tensorflow.compat.v1 as tf
tf.compat.v1.disable_eager_execution()
import numpy as np
import math
import timeit
import matplotlib.pyplot as plt
%matplotlib inline

接下来我们用placeholder（占位符）声明X和y。

X = tf.placeholder(tf.float32, [None, 32, 32, 3])
y = tf.placeholder(tf.int64, [None])
is_training = tf.placeholder(tf.bool) # batchnorm时，train和test不一样，因此要记录一下

声明模型：

def my_model(X,y,is_training):
    # Conv-Relu-BN
    conv1act = tf.layers.conv2d(inputs=X, filters=32, padding='same', kernel_size=3, strides=1, activation=tf.nn.relu)
    bn1act = tf.layers.batch_normalization(inputs=conv1act, training=is_training)
    # Conv-Relu-BN
    conv2act = tf.layers.conv2d(inputs=bn1act, filters=64, padding='same', kernel_size=3, strides=1,
                                activation=tf.nn.relu)
    bn2act = tf.layers.batch_normalization(inputs=conv2act, training=is_training)
    # Maxpool
    maxpool1act = tf.layers.max_pooling2d(inputs=bn2act, pool_size=2, strides=2)
    # Flatten
    flatten1 = tf.reshape(maxpool1act,[-1,16384])
    # FC-Relu-BN
    fc1 = tf.layers.dense(inputs=flatten1, units=1024, activation=tf.nn.relu)
    bn3act = tf.layers.batch_normalization(inputs=fc1, training=is_training)
    # Output FC 
    y_out = tf.layers.dense(inputs=bn3act, units=10, activation=None)
    
    return y_out

接下来，声明loss和optimizer。

# clear old variables
tf.reset_default_graph()

y_out = my_model(X,y,is_training)
mean_loss = tf.losses.softmax_cross_entropy(logits=y_out, onehot_labels=tf.one_hot(y,10))
optimizer = tf.train.AdamOptimizer(learning_rate=0.001)

# batch normalization in tensorflow requires this extra dependency，好像是一个依赖的意思
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(extra_update_ops):
    train_step = optimizer.minimize(mean_loss)

cs231n给出了训练的函数，直接粘过来：

def run_model(session, predict, loss_val, Xd, yd,
              epochs=1, batch_size=64, print_every=100,
              training=None, plot_losses=False):
    # have tensorflow compute accuracy
    correct_prediction = tf.equal(tf.argmax(predict,1), y)
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    
    # shuffle indicies
    train_indicies = np.arange(Xd.shape[0])
    np.random.shuffle(train_indicies)

    training_now = training is not None
    
    # setting up variables we want to compute (and optimizing)
    # if we have a training function, add that to things we compute
    variables = [mean_loss,correct_prediction,accuracy]
    if training_now:
        variables[-1] = training
    
    # counter 
    iter_cnt = 0
    for e in range(epochs):
        # keep track of losses and accuracy
        correct = 0
        losses = []
        # make sure we iterate over the dataset once
        for i in range(int(math.ceil(Xd.shape[0]/batch_size))):
            # generate indicies for the batch
            start_idx = (i*batch_size)%Xd.shape[0]
            idx = train_indicies[start_idx:start_idx+batch_size]
            
            # create a feed dictionary for this batch
            feed_dict = {X: Xd[idx,:],
                         y: yd[idx],
                         is_training: training_now }
            # get batch size
            actual_batch_size = yd[idx].shape[0]
            
            # have tensorflow compute loss and correct predictions
            # and (if given) perform a training step
            loss, corr, _ = session.run(variables,feed_dict=feed_dict)
            
            # aggregate performance stats
            losses.append(loss*actual_batch_size)
            correct += np.sum(corr)
            
            # print every now and then
            if training_now and (iter_cnt % print_every) == 0:
                print("Iteration {0}: with minibatch training loss = {1:.3g} and accuracy of {2:.2g}"\
                      .format(iter_cnt,loss,np.sum(corr)/actual_batch_size))
            iter_cnt += 1
        total_correct = correct/Xd.shape[0]
        total_loss = np.sum(losses)/Xd.shape[0]
        print("Epoch {2}, Overall loss = {0:.3g} and accuracy of {1:.3g}"\
              .format(total_loss,total_correct,e+1))
        if plot_losses:
            plt.plot(losses)
            plt.grid(True)
            plt.title('Epoch {} Loss'.format(e+1))
            plt.xlabel('minibatch number')
            plt.ylabel('minibatch loss')
            plt.show()
    return total_loss,total_correct

我们开始训练吧：

sess = tf.Session() # session封装了compute graph的状态和相关控制

sess.run(tf.global_variables_initializer())
print('Training')
run_model(sess,y_out,mean_loss,X_train,y_train,10,64,100,train_step,True)
print('Validation')
run_model(sess,y_out,mean_loss,X_val,y_val,1,64)

test一下：

print('Test')
run_model(sess,y_out,mean_loss,X_test,y_test,1,64)

posted @ 2021-10-05 21:10 MoonOut 阅读(331) 评论(0) 编辑收藏举报

刷新页面返回顶部

月出兮彩云归 🌙

DL 基础 | cs231n assignment 2

cs231n assignment 2

fully-connected nets

基本思想

编程细节

复习multiclass svm loss和softmax loss

multiclass svm loss & derivative

softmax loss & derivative

batch normalization

基本思想

编程细节

dropout

基本思想

编程细节

convolutional networks

基本思想

convolution

max pooling

spatial batch normalization

编程细节

PyTorch quick start

TensorFlow quick start

公告