DL 基础 | cs231n assignment 2

cs231n assignment 2

20210913 - 20211005。

fully-connected nets

基本思想

把各种layer封装起来,就可以modular programming了。

封装一个forward,输入是computational graph节点的输入,输出是节点的输出+需要缓存的信息。

封装一个backward,输入是upstream的derivative即计算图节点输出的derivative,输出是各个计算图节点输入的derivative。

backward时可以根据链式法则,按照维度无脑矩阵乘法算偏导数。

编程细节

x_rsp = x.reshape(x.shape[0], -1) # N*d1*d2*... -> N*D,一行一个数据
A = B.dot(C) # 矩阵乘法
dx = dx.reshape(x.shape) # 把我reshape成你的shape
out = x * (x >= 0) # relu:保留≥0的值,精简numpy写法
dx = (x > 0) * dout # relu的backprop

关于fully-connected layer中的w维度:

layer_input_dim = input_dim
for i, hd in enumerate(hidden_dims):
    self.params['W%d'%(i+1)] = weight_scale * np.random.randn(layer_input_dim, hd)
    self.params['b%d'%(i+1)] = np.zeros(hd)
    if self.use_batchnorm:
        self.params['gamma%d'%(i+1)] = np.ones(hd)
        self.params['beta%d'%(i+1)] = np.zeros(hd)
    layer_input_dim = hd
self.params['W%d'%(self.num_layers)] = weight_scale * np.random.randn(layer_input_dim, num_classes)
self.params['b%d'%(self.num_layers)] = np.zeros(num_classes)

带momentum的stochastic gradient descent:

v = config['momentum'] * v - config['learning_rate'] * dw
# 速度衰减0.9,再加上加速度的方向
next_w = w + v # 用速度更新W
config['velocity'] = v # 记录更新后的速度

RMSProp:

config['cache'] = config['decay_rate'] * config['cache'] + (1 - config['decay_rate']) * (dx**2)
# cache:是以decay_rate为权重的,【原来cache】与【dx平方】的加权平均
next_x = x - config['learning_rate'] * dx / (np.sqrt(config['cache']) + config['epsilon'])
# next_x:走learning_rate的步长,方向为负的 dx除sqrt(cache)+小epsilon(防止除0)。

Adam:

config['t'] += 1
# t:每次更新W都++,用来牵制mb和vb的增长速度
config['m'] = config['beta1'] * config['m'] + (1 - config['beta1']) * dx
# m:是以beta1为权重的,【原来m】与【dx】的加权平均
config['v'] = config['beta2'] * config['v'] + (1 - config['beta2']) * (dx**2)
# v:是以beta2为权重的,【原来v】与【dx平方】的加权平均
mb = config['m'] / (1 - config['beta1']**config['t'])
# mb:原来m 除 1-第一个β参数的t次方,变大了一点点。随着t越来越大,β1**t越来越小,1-β1**t越来越大,除以它就越来越小。因此mb的增加速率越来越小。
vb = config['v'] / (1 - config['beta2']**config['t'])
# vb:原来v 除 1-第二个β参数的t次方,变大了一点点。与上面一样。
next_x = x - config['learning_rate'] * mb / (np.sqrt(vb) + config['epsilon'])
# next_x:走learning_rate的步长,方向为负的 mb除sqrt(vb)+小epsilon。

Adam是怎么一回事呢,就是:

  • 我们要stochastic gradient descent,就要瞄准一个下降方向,走learning rate的步长。
  • 瞄准什么方向呢,瞄准 mb 除 sqrt(vb)+epsilon 的方向。
  • mb是干啥的呢,它是 m 除 (1-β1^t)。
    • 除(1-β1t)是用来缓慢减小mb值的,随着t累加,β1t减小,1-β1^t增大,除它又减小,因此除它是用来缓慢减小mb值的。
    • m是干啥的呢,它其实是momentum,更新公式是 原m与现dx的加权平均。
  • 那vb是干啥的呢,它是 v 除 (1-β2^t)。
    • 除(1-β2^t)啊,估计也是用来缓慢减小vb值的。虽然vb最后要放在前进方向的分母上,好矛盾诶。
    • v是干啥的呢,是RMSProp的奇妙操作,更新公式是 原v与现dx²的加权平均。
  • 因此,Adam综合了momentum和RMSProp,又沿着momentum方向前进,又除平方dx,同时还奇妙地用【除(1-β^t)】牵制两者。

复习multiclass svm loss和softmax loss

multiclass svm loss & derivative

好像又被叫做hinge loss。

N = x.shape[0]
correct_class_scores = x[np.arange(N), y] # 正确类别的分数,N*1的向量
margins = np.maximum(0, x - correct_class_scores[:, np.newaxis] + 1.0)
margins[np.arange(N), y] = 0 # 我们只计算错误类别
loss = np.sum(margins) / N # 对N个样本求loss,然后做平均作为最后的loss
num_pos = np.sum(margins > 0, axis=1)
dx = np.zeros_like(x) # x形状的全0矩阵
dx[margins > 0] = 1 # loss增大方向:错误类别分数增加
dx[np.arange(N), y] -= num_pos
# loss增大方向:每个【错误类别分数增加】都对应一个【正确类别分数减小】
dx /= N # 对N个样本效果的平均
return loss, dx

softmax loss & derivative

又被叫做cross entropy loss。

shifted_logits = x - np.max(x, axis=1, keepdims=True)
# 相当于exp(x)/exp(max(x))
Z = np.sum(np.exp(shifted_logits), axis=1, keepdims=True)
# 相当于对exp(x)/exp(max(x))求sum,sum(exp(x))/exp(max(x))
# 也就是sum(exp(x))再除exp的max(x)
log_probs = shifted_logits - np.log(Z)
# 相当于exp(x)/sum(exp(x))的log,也就是概率的log,这样算省了很多exp
# 关于貌似无用的“减去max(x)”:https://zhuanlan.zhihu.com/p/92714192
probs = np.exp(log_probs) # 这是概率
N = x.shape[0]
loss = -np.sum(log_probs[np.arange(N), y]) / N
# loss就是-log(正确概率),最后对N个样本取平均
dx = probs.copy() # 首先dx=算出来的概率
dx[np.arange(N), y] -= 1 # 然后所有正确分类的概率-=1
# 不知道为什么反正就这么算
dx /= N # 最后对N个样本做平均,因为每个样本对loss只贡献了1/N?
return loss, dx

batch normalization

基本思想

先把一个minibatch的数据变成0均值1方差,然后再乘γ加β。这是一个特殊的层。

它一般被用在ReLU层前面。

编程细节

forward:

sample_mean = np.mean(x,axis=0)
sample_var = np.var(x,axis=0)
x_hat = (x - sample_mean) / (np.sqrt(sample_var+eps))
out = gamma * x_hat + beta
cache = (gamma, x, sample_mean, sample_var, x_hat)
running_mean = momentum * running_mean + (1-momentum) * sample_mean
running_var = momentum * running_var + (1-momentum) * sample_var

# test的时候
scale = gamma / (np.sqrt(running_var  + eps))
out = x * scale + (beta - running_mean * scale)
# 其实没什么区别,只是这样好像计算量小一点,能用标量尽量不用向量

backward:

# 估计我下次看也看不懂了
# 大意就是,x若变化,均值和方差也会变,求导时也要考虑这个。
gamma, x, sample_mean, sample_var, eps, x_hat = cache
N = x.shape[0]

dbeta = np.sum(dout, axis=0) # 是的,是sum,把每一个样本的影响累加
dgamma = np.sum(dout*x_hat, axis=0)

dy_wrt_dmean = -gamma / np.sqrt(sample_var+eps) * dout
dy_wrt_dvar = -0.5 * gamma * np.power(sample_var+eps,-1.5)
dmean_wrt_dx = 1.0 / N # 是的,每个人都贡献了1/N。直接用1可能会整数除法?
dvar_wrt_dx = 2.0 / N * (x-sample_mean) # 根据方差的计算公式
dy_wrt_dx = gamma / np.sqrt(sample_var+eps) * dout
dx = dy_wrt_dx + dy_wrt_dmean * dmean_wrt_dx + dy_wrt_dvar * dvar_wrt_dx
# 正确性存疑,虽然抄的别人的代码,但是有误差

方差计算公式:

\[s^2=\frac{(x_1-\mu)^2+(x_2-\mu)^2+\cdots+(x_N-\mu)^2}{N} \]

dropout

基本思想

原dropout:train的时候以p的概率随机把neuron赋0,test的时候把整层的输出乘(1-p)。

inverted dropout:train的时候以p的概率随机把neuron赋0,也就是保留了(1-p)的原数值,然后再把所有数值除(1-p)(就像做平均一样),试图通过放大留下的(1-p)个人的影响,假装什么都没发生。test的时候,不需要做任何事情。

网络结构:affine - [batch norm] - relu - [dropout]。

编程细节

# forward
mask = (np.random.rand(*x.shape) >= p) / (1-p)
out = x * mask
# backward
dx = dout * mask

在【fully connect - batch norm - relu - dropout】结构中添加dropout:forward时,在最后把输出dropout一下;backward时,把上一层的输出先做一个dropout backward。

convolutional networks

基本思想

convolution

input的shape是(N, C, H, W),其中N是样本数量,C是channel个数(RGB),HW是高和宽。

filter的shape是(F, C, HH, WW),F是卷积核个数,HH是卷积核高,WW是卷积核宽。

output的shape是(N, F, H_out, W_out),对每一个样本 用F个filter 做卷积操作,因此第一个dimension是N,第二个是F。H_out和W_out是卷积后的高和宽。

卷积还有一个biases参数,是长度为F的向量,负责整体平移卷积后的map。

还有两个超参数:stride步长、pad填充。

H_out和W_out这样计算:

H_out = 1 + (H + 2 * pad - HH) // stride
W_out = 1 + (W + 2 * pad - WW) // stride

算卷积结果的时候,这样写:(naive)

out[:, f, i, j] = np.sum(x_masked * w[f,:,:,:], axis=(1,2,3))

max pooling

input的shape是(N, C, H, W),pooling的参数有HH、WW和stride。

我们每次考虑HH*WW的方形区域,记录该区域的最大值,每次走stride的步长。

输出的shape是(N, C, H_out, W_out),其中H_out和W_out这样计算(同卷积):

H_out = 1 + (H - HH) // stride
W_out = 1 + (W - WW) // stride

计算max的时候,使用np.max(x_masked, axis=(2,3))

spatial batch normalization

设input为四维矩阵 (N, C, H, W)。在cnn中,我们把每个 feather map 看成是一个特征处理(一个神经元),因此在使用 spatial batchnorm 的时候,mini-batch size 就是:N*H*W,于是对于每个特征图都只有两个可学习参数:γ、β。

也就是说,求取所有样本的某一个特征图的【所有】神经元的均值方差,然后对这个特征图神经元做归一化。

https://blog.csdn.net/hjimce/article/details/50866313

编程细节

convolution:

# forward
N, C, H, W = x.shape
F, _, HH, WW = w.shape
stride, pad = conv_param['stride'], conv_param['pad']
H_out = 1 + (H + 2 * pad - HH) // stride
W_out = 1 + (W + 2 * pad - WW) // stride
out = np.zeros((N, F, H_out, W_out))

x_pad = np.pad(x, ((0,0), (0,0), (pad,pad), (pad,pad)), mode='constant',constant_values=0)
"""
np.pad:填充数组的边缘,就是一个padding操作。
第一个参数是需要填充的数组。
第二个参数是填充大小,格式为((before_1, after_1), … (before_N, after_N)),其中(before_1, after_1)表示第1轴两边缘分别填充before_1个和after_1个数值。
最后一个参数表示填充的方式。
"""
for i in range(H_out):
    for j in range(W_out):
        x_pad_masked = x_pad[:, :, i*stride:i*stride+HH, j*stride:j*stride+WW]
        for k in range(F):
            out[:, k, i, j] = np.sum(x_pad_masked * w[k, :, :, :], axis=(1,2,3))
for k in range(F):
    out[:, k, :, :] += b[k]

# backward
x, w, b, conv_param = cache
N, C, H, W = x.shape
F, _, HH, WW = w.shape
stride, pad = conv_param['stride'], conv_param['pad']
N, F, H_out, W_out = dout.shape

x_pad = np.pad(x, ((0,0), (0,0), (pad,pad), (pad,pad)), mode='constant', constant_values=0)
dx = np.zeros_like(x)
dx_pad = np.zeros_like(x_pad)
dw = np.zeros_like(w)
db = np.sum(dout, axis=(0,2,3))

for i in range(H_out):
    for j in range(W_out):
        x_pad_masked = x_pad[:, :, i*stride:i*stride+HH, j*stride:j*stride+WW]
        for k in range(F): # compute dw
            dw[k,:,:,:] += np.sum(x_pad_masked * (dout[:,k,i,j])[:, None, None, None], axis=0)
            # 对每个filter,sum用来累加N个样本的影响
		for n in range(N): # compute dx_pad
            dx_pad[n, :, i*stride:i*stride+HH, j*stride:j*stride+WW] += np.sum((w[:,:,:,:] * (dout[n, :, i, j])[:,None ,None, None]), axis=0)
            # 对每个样本,sum用来累加F个filter带来的梯度
dx = dx_pad[:,:,pad:-pad,pad:-pad]

max pooling:

# forward
N, C, H, W = x.shape
HH, WW, stride = pool_param['pool_height'], pool_param['pool_width'], pool_param['stride']
H_out = (H - HH) // stride + 1
W_out = (W - WW) // stride + 1
out = np.zeros((N, C, H_out, W_out))
for i in range(H_out):
    for j in range(W_out):
        x_masked = x[:, :, i*stride:i*stride+HH, j*stride:j*stride+WW]
        out[:,:,i,j] = np.max(x_masked, axis=(2,3))

# backward
x, pool_param = cache
N, C, H, W = x.shape
HH, WW, stride = pool_param['pool_height'], pool_param['pool_width'], pool_param['stride']
N, C, H_out, W_out = dout.shape
dx = np.zeros_like(x)

for i in range(H_out):
    for j in range(W_out):
        x_masked = x[:, :, i*stride:i*stride+HH, j*stride:j*stride+WW]
        max_x_masked = np.max(x_masked,axis=(2,3))
        temp_binary_mask = (x_masked == (max_x_masked)[:,:,None,None])
        # 如果出现多个数同时为max,那么这多个数都要继承梯度
        dx[:, :, i*stride:i*stride+HH, j*stride:j*stride+WW] += temp_binary_mask * (dout[:,:,i,j])[:,:,None,None]

spatial batch normalization:

# forward
N, C, H, W = x.shape
temp_output, cache = batchnorm_forward(x.transpose(0,3,2,1).reshape((N*H*W,C)), gamma, beta, bn_param)
out = temp_output.reshape(N,W,H,C).transpose(0,3,2,1)

# backward

PyTorch quick start

首先,import一堆东西:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
from torch.utils.data import DataLoader
from torch.utils.data import sampler

import torchvision.datasets as dset
import torchvision.transforms as T

import numpy as np

import timeit

然后,因为本人没有GPU,所以把数据类型定义成CPU的数据类型:

dtype = torch.FloatTensor # the CPU datatype
torch.cuda.is_available() # 用这个来看有没有GPU,如果有的话会返回True
gpu_dtype = torch.cuda.FloatTensor # the GPU datatype

然后,我们定义一个flatten,它用来把 shape 为 N*C*H*W 的输入展开成 N*?? 的shape,就是一个np.reshape(x,(x.shape[0],-1))操作。

class Flatten(nn.Module):
    def forward(self, x):
        N, C, H, W = x.size() # read in N, C, H, W
        return x.view(N, -1)  # "flatten" the C * H * W values into a single vector per image

接下来,我们定义模型:

'''
architecture:
[conv - ReLU - BatchNorm - MaxPool] -
[conv - ReLU - BatchNorm - MaxPool] -
[affine - BatchNorm - ReLU] -
[affine - softmax]
'''
model_base = nn.Sequential(nn.Conv2d(in_channels=3,out_channels=16, kernel_size=5, stride=1),
                           nn.ReLU(inplace=True),
                           nn.BatchNorm2d(num_features=16),
                           nn.MaxPool2d(kernel_size=2,stride=2),
                           nn.Conv2d(in_channels=16,out_channels=32, kernel_size=3, stride=1),
                           nn.ReLU(inplace=True),
                           nn.BatchNorm2d(num_features=32),
                           nn.MaxPool2d(kernel_size=2,stride=2),
                           Flatten(),
                           nn.Linear(1152,200),  # 1152=32*6*6 input size
                           nn.BatchNorm1d(num_features=200),
                           nn.ReLU(inplace=True),
                           nn.Linear(200, 10), # affine layer
                          )

model = model_base.type(dtype) # 先定义base,再把具体数据类型套到base上
loss_fn = nn.CrossEntropyLoss().type(dtype)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

cs231n提供了训练和check accuracy的函数,我们直接抄过来:

def train(model, loss_fn, optimizer, num_epochs = 1):
    for epoch in range(num_epochs):
        print('Starting epoch %d / %d' % (epoch + 1, num_epochs))
        model.train()
        for t, (x, y) in enumerate(loader_train):
            x_var = Variable(x.type(dtype))
            y_var = Variable(y.type(dtype).long())
            scores = model(x_var)
            loss = loss_fn(scores, y_var)
            
            if (t + 1) % print_every == 0:
                print('t = %d, loss = %.4f' % (t + 1, loss.item()))

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

def check_accuracy(model, loader):
    if loader.dataset.train:
        print('Checking accuracy on validation set')
    else:
        print('Checking accuracy on test set')   
    num_correct = 0
    num_samples = 0
    model.eval() # Put the model in test mode (the opposite of model.train(), essentially)
    for x, y in loader:
        with torch.no_grad():
            x_var = Variable(x.type(dtype))

        scores = model(x_var)
        _, preds = scores.data.cpu().max(1)
        num_correct += (preds == y).sum()
        num_samples += preds.size(0)
    acc = float(num_correct) / num_samples
    print('Got %d / %d correct (%.2f)' % (num_correct, num_samples, 100 * acc))

然后我们开始训练:

train(model, loss_fn, optimizer, num_epochs=10)
check_accuracy(model, loader_val) # validation
check_accuracy(best_model, loader_test) # test

TensorFlow quick start

首先import一堆东西:

import tensorflow.compat.v1 as tf
tf.compat.v1.disable_eager_execution()
import numpy as np
import math
import timeit
import matplotlib.pyplot as plt
%matplotlib inline

接下来我们用placeholder(占位符)声明X和y。

X = tf.placeholder(tf.float32, [None, 32, 32, 3])
y = tf.placeholder(tf.int64, [None])
is_training = tf.placeholder(tf.bool) # batchnorm时,train和test不一样,因此要记录一下

声明模型:

def my_model(X,y,is_training):
    # Conv-Relu-BN
    conv1act = tf.layers.conv2d(inputs=X, filters=32, padding='same', kernel_size=3, strides=1, activation=tf.nn.relu)
    bn1act = tf.layers.batch_normalization(inputs=conv1act, training=is_training)
    # Conv-Relu-BN
    conv2act = tf.layers.conv2d(inputs=bn1act, filters=64, padding='same', kernel_size=3, strides=1,
                                activation=tf.nn.relu)
    bn2act = tf.layers.batch_normalization(inputs=conv2act, training=is_training)
    # Maxpool
    maxpool1act = tf.layers.max_pooling2d(inputs=bn2act, pool_size=2, strides=2)
    # Flatten
    flatten1 = tf.reshape(maxpool1act,[-1,16384])
    # FC-Relu-BN
    fc1 = tf.layers.dense(inputs=flatten1, units=1024, activation=tf.nn.relu)
    bn3act = tf.layers.batch_normalization(inputs=fc1, training=is_training)
    # Output FC 
    y_out = tf.layers.dense(inputs=bn3act, units=10, activation=None)
    
    return y_out

接下来,声明loss和optimizer。

# clear old variables
tf.reset_default_graph()

y_out = my_model(X,y,is_training)
mean_loss = tf.losses.softmax_cross_entropy(logits=y_out, onehot_labels=tf.one_hot(y,10))
optimizer = tf.train.AdamOptimizer(learning_rate=0.001)

# batch normalization in tensorflow requires this extra dependency,好像是一个依赖的意思
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(extra_update_ops):
    train_step = optimizer.minimize(mean_loss)

cs231n给出了训练的函数,直接粘过来:

def run_model(session, predict, loss_val, Xd, yd,
              epochs=1, batch_size=64, print_every=100,
              training=None, plot_losses=False):
    # have tensorflow compute accuracy
    correct_prediction = tf.equal(tf.argmax(predict,1), y)
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    
    # shuffle indicies
    train_indicies = np.arange(Xd.shape[0])
    np.random.shuffle(train_indicies)

    training_now = training is not None
    
    # setting up variables we want to compute (and optimizing)
    # if we have a training function, add that to things we compute
    variables = [mean_loss,correct_prediction,accuracy]
    if training_now:
        variables[-1] = training
    
    # counter 
    iter_cnt = 0
    for e in range(epochs):
        # keep track of losses and accuracy
        correct = 0
        losses = []
        # make sure we iterate over the dataset once
        for i in range(int(math.ceil(Xd.shape[0]/batch_size))):
            # generate indicies for the batch
            start_idx = (i*batch_size)%Xd.shape[0]
            idx = train_indicies[start_idx:start_idx+batch_size]
            
            # create a feed dictionary for this batch
            feed_dict = {X: Xd[idx,:],
                         y: yd[idx],
                         is_training: training_now }
            # get batch size
            actual_batch_size = yd[idx].shape[0]
            
            # have tensorflow compute loss and correct predictions
            # and (if given) perform a training step
            loss, corr, _ = session.run(variables,feed_dict=feed_dict)
            
            # aggregate performance stats
            losses.append(loss*actual_batch_size)
            correct += np.sum(corr)
            
            # print every now and then
            if training_now and (iter_cnt % print_every) == 0:
                print("Iteration {0}: with minibatch training loss = {1:.3g} and accuracy of {2:.2g}"\
                      .format(iter_cnt,loss,np.sum(corr)/actual_batch_size))
            iter_cnt += 1
        total_correct = correct/Xd.shape[0]
        total_loss = np.sum(losses)/Xd.shape[0]
        print("Epoch {2}, Overall loss = {0:.3g} and accuracy of {1:.3g}"\
              .format(total_loss,total_correct,e+1))
        if plot_losses:
            plt.plot(losses)
            plt.grid(True)
            plt.title('Epoch {} Loss'.format(e+1))
            plt.xlabel('minibatch number')
            plt.ylabel('minibatch loss')
            plt.show()
    return total_loss,total_correct

我们开始训练吧:

sess = tf.Session() # session封装了compute graph的状态和相关控制

sess.run(tf.global_variables_initializer())
print('Training')
run_model(sess,y_out,mean_loss,X_train,y_train,10,64,100,train_step,True)
print('Validation')
run_model(sess,y_out,mean_loss,X_val,y_val,1,64)

test一下:

print('Test')
run_model(sess,y_out,mean_loss,X_test,y_test,1,64)
posted @ 2021-10-05 21:10  MoonOut  阅读(311)  评论(0编辑  收藏  举报