机器学习_李宏毅
notebook, shen fang, 2895044375@qq.com
Machine Learning
2022春机器学习,国立台湾大学,李宏毅
1 Introduction of Deep Learning
1.1 Machine Learning Intro
machine learning --> looking for function (so complex that cannot be written manually)
** different types of functions **
regression : the function outputs a scalar
classification : the function outputs correct options (classes)
structured learning : create something with structure (image, document)
** how to find the function **
model --> loss --> optimization
batch : divide samples into batchs, compute loss for one batch to compute gradients and update parameters for one optimization
epoch : trained all batchs
1.2 Deep Learning
** backpropagation **
for a fully-connected multi-layer perceptron, the \(i\)th layers behaves
where \(\sigma(z)\) is the activation function, \(i\) denotes layer index, \(k\) denotes feature index
considering the L2 loss function
where \(n\) is the number of output features, \(m\) is the number of layers
the backpropagation behaves
then gradient descent behaves
1.3 Regression
** linear model **
model function :
where \(n\) is the number of input feature
loss function :
where \(N\) is the number of output feature
optimization :
** model selection **
polynomial model for Pokemon CP --> higher order polynomial, more complex model, less training loss --> testing loss decreases when lower than 3 order, testing loss increases when higher than 3 order --> overfitting from complex model
1.4 Classification
** ideal model for binary classification **
model :
\(f(x) = \begin{cases}
g(x)>0, & \text{if output = class1} \\
\text{else}, & \text{if output = class2}
\end{cases}\)
loss : \(L(f) = \sum_i\delta(f(x_i) \ne \hat{y}_i)\)
** probabilistic generative model **
for a binary classification, given an \(x\), the probability that it belongs to \(C_1\) is
generative model : \(P(x) = P(x|C_1)P(C_1) + P(x|C_2)P(C_2)\)
assume that the dataset is sampled from a Gaussian distribution
how to find the Gaussian distrtibution --> maximum likelihood
where \(N\) is the sample number, and we can get that
we have
model : \(P(C_1|x)>0.5\), class1; otherwise, class2
1.5 Logistic Regression
revisite the probalisitic generative model, we have
let
we have the sigmoid function
considering the new variable
usually \(\Sigma_1\) and \(\Sigma_2\) are assumed to be the same and we have
then we can get the logistic regression model :
loss : assume that \(f_{w,b}(x) = P_{w,b}(C_1|x)\), we have the loss
optimization :
let \(\hat{y}(x^s|x^s\in C_1) = 1\), \(\hat{y}(x^s|x^s\in C_2) = 0\), we have
the loss function becomes
where the summed term is the cross entropy of two Bernoulli distributions
gradient descent :
Question : why cross entropy rather than L2 loss ?
for L2 loss, we have
if \(\hat{y}^s=1\), \(f(x^s)=1\), gradient descent will be slow
if \(\hat{y}^s=1\), \(f(x^s)=0\), gradient descent will also be slow (originating from sigmoid function)
2 What to do if My Network Fails to Train
2.1 Machine Learning Strategy
considering optimization : deeper network behaves worse than shallower network
model constraint for overfitting : less parameter; less feature; early stopping; regularization; dropout; n-fold cross validation; ...
mismatch : training data and testing data are not in the same distribution
2.2 Optimization Strategy 1 -- Critical Points
For a loss function \(L(\theta)\), we have the Taylor expansion at \(\theta=\theta'\)
where \(g\) is the gradient, which equals to zero at critical points
and \(H\) is the Hassian matrix
at critical point, we have
| class | for all \((\theta-\theta')\) \((\theta-\theta') H (\theta-\theta')^\top\) |
eigenvalues of \(H\) |
|---|---|---|
| local maxima | \(<0\) | all negative |
| local minima | \(>0\) | all positive |
| saddle point | \(>0\) and \(<0\) | positive and negative |
Comments : saddle points are much more frequently encountered than local minima in high dimensional space
2.3 Optimization Strategy 2 -- Batch, Momentum and Learning Rate
** optimization with batch **
bigger batch size optimizes faster, smaller batch size has better performance
** momentum **
movement --> movement of last step minus gradient at present
** learning rate **
parameter dependent learning rate
(1) Adagrad --> root mean square parameter
(2) RMSProp --> depend on previous parameter (Adam --> RMSProp + momentum)
** learning rate scheduling **
(1) learning rate decay --> decay with time
(2) warm up --> increase and then decrease (at the beginning, the estimate of \(\sigma_i^{(t)}\) has large variance)
2.4 Optimization Strategy 3 -- Batch Normalization
for smaples in a batch \({\mathbf{x}^1, \mathbf{x}^2, ..., \mathbf{x}^N}\), and \(\mathbf{x}^i = [x_1^i, x_2^i, ..., x_m^i]^\top\), normalize features in the sample as
where
such that mean of the normalized feature is 1 and variance is 0
training : compute moving average of the mean and variance in every batch \(\mathcal{B}\)
testing : normalize testing sample with \(\bar{\mu}\) and \(\bar{\sigma}\)
2.5 Training Dataset
considering the loss function \(L(h,\mathcal{D})\) of a model, where \(h\) is the parameter for the model, \(\mathcal{D}\) is the training set, for the best parameter of the model
while usually training dataset \(\mathcal{D}_{\text{train}}\) is sampled from total dataset such that
we hope that \(L(h^\text{train},\mathcal{D}_{\text{all}})\) and \(L(h^\text{all},\mathcal{D}_{\text{all}})\) are close, which can be formulated as
a "good" training set \(\mathcal{D}_{\text{train}}\) will sartisfy the follow property
\(\mathcal{Proof}\) :
the probability of sampling a "bad" training set is
according to Hoeffding's inequality, we have
where \(N\) is the number of samples from \(\mathcal{D}_{\text{train}}\), and we have
to make the probability of sampling a "bad" training set small, we could select larger \(N\) and smaller \(|\mathcal{H}|\)
trade-off of model complexity : small \(|\mathcal{H}|\) --> small gap between idea and reality but bad reality; big \(|\mathcal{H}|\) --> good reality but big gap between idea and reality
3 Image as Input
3.1 Convolutional Neural Network
image input --> height \(\times\) width \(\times\) channel
observation 1 : receptive field --> neuron / filter for detecting small patterns (kernel size, stride, padding)
observation 2 : the same patterns appear in different regions / each filter convolves over the input image --> parameter sharing (same parameters for every neuron / filter)
receptive field + parameter sharing --> convolutional layer
3.2 Spatial Transformer Layer
CNN is not invariant to scaling and rotation --> transform feature map before CNN
for any affine transformation of an image
which means that transformed pixel \(a_{x'y'}^{(l)}\) are derived from original pixel \(a_{xy}^{(l-1)}\)
considering the integer-index of transformed pixel, original index \(x\) and \(y\) may be non-integer
if we simply round the non-integer index, gradient descent could be unenabled, e.g.
when parameters of STL changes, we have
which means the map doesn't change with parameters such that the gradient will be zero
interpolation : transformed pixels are interpolations of neighborhoods of original pixels
where the transformed pixels will change with parameters due to the changed map
4 Sequence as Input
4.1 Recurrent Neural Network
save memory of sequence as hidden state (Elman Network)
** bidiractional RNN ** --> forward + backward
** long short-term memory (LSTM) **
inputs --> input gate --> memory cell (with forget gate) --> output gate --> outputs
the input \(z\) goes through the network as
where activation function \(f\) usually is a sigmoid function, \(z_i\), \(z_f\) and \(z_o\) are the inputs of 3 gates mentioned above
for the complete version of LSTM, inputs are the cancatenation of hidden states and peephole
strength : can deal with gradient vanishing (not gradient explode) --> memory and input are added such that gradient may be bigger than 1
4.2 Graph Neural Network
graph : molecule, subway map, social network, ... --> node + edge
GNN : classification (molecule classifier), generation (durg design), ...
4.3 Spatial-based GNN
forward --> layer \(t\) \(\xrightarrow{\text{spatial-based convolution}}\) layer \(t+1\)
aggregate --> update hidden states of one node with its neighbor nodes
readout --> collect all node features to generate graph features
** NN4G (Neural network for graph) **
embedding -->
aggregate -->
readout -->
** MoNET (Mixture model network) **
define the distance between two nodes \(\mathbf{u}(i,j)\), we can reformulate the aggragate block with weighted sum
** GAT (Graph attention network) **
compute attention of neighbor nodes for weighted sum of aggregate block
** GIN (Graph isomorphism network) **
proved that the sum of neighbor node features works better than mean or max-pooling, such that the best aggregate block behaves
4.4 Spectral-based GNN
** signal processing **
synthesis --> \(A = \sum a_k \hat{v}_k\), analysis --> \(a_j = A \cdot \hat{v}_j\), where \(\hat{v}_i\) is assumed to be orthogonal basis
considering the signal \(x(t)\) formulated in time domain
where \(\delta(t-\tau)\) is the basis
for the signal \(x(t)\) in frequency domain we have
where \(e^{j\omega t}\) is the basis
Fourier transform --> analysis in frequency domain :
** spectral graph theory **
for an undirected graph \(\mathcal{G}=(V,E)\) and \(N=|V|\), define
(1) adjacency matrix \(A \in \mathbb{R}^{N \times N}\) :
\(A_{i,j} = \begin{cases}
0, & e_{i,j} \notin E \\
w(v_i,v_j), & e_{i,j} \in E
\end{cases}
\) , which is symmetric
(2) degree matrix \(D \in \mathbb{R}^{N \times N}\) :
\(
D_{i,j} = \begin{cases}
\sum_k A_{i,k}, & i=j \\
0, & i \ne j
\end{cases}
\) , which is diagonal
(3) signal on graph \(f:V \rightarrow \mathbb{R}^N\)
(4) graph Laplacian \(L=D-A\) , which is positive semi-definite ( WHY ? --> \(f^\top Lf \geq 0\) )
spectral decomposition :
where \(\lambda_i\) is called frequency and \(\mathbf{u}_i\) is the corresponding basis
operate \(L\) on the graph \(\mathcal{G}\), we have
which denotes the "power" of signal variation between vertices
for basis \(\mathbf{u}_i\) as graph signal, we have
which shows that large frequency corresponds to large signal variation
graph Fourier transform --> \(\hat{x} = U^\top x\), \(\hat{x}_i = \mathbf{u_i}^\top x\) (seems like vector projection)
inverse graph Fourier transform --> \(x = U \hat{x} = \sum_k \mathbf{u_k} \hat{x}_k\)
filtering --> \(\hat{y} = g_\theta (\Lambda) \hat{x}\)
the total network goes through
where \(g_\theta(L)\) is the optimization object
** ChebNet **
use polynomial to parametrize \(g_\theta(L)\)
where the number of parameters to learn is fixed to be \(K\) and the graph is made \(K\)-localized
Problem --> \(O(N^2)\) complexity
Solution --> use Chebyshev polynomial which is computationally recursive as polynomial kernel
adjust frequency matrix for Chebyshev consition
then the optimization object becomes
the output goes
where the total complexity becomes \(O(K|E|)\)
** GCN (Graph convolutional network) **
normalized graph Laplacian \(L^{\text{norm}} = D^{-\frac{1}{2}}LD^{-\frac{1}{2}} = I_N - D^{-\frac{1}{2}}AD^{-\frac{1}{2}}\)
the output goes
for the eigenvalues of \(I+D^{-\frac{1}{2}}AD^{-\frac{1}{2}}\) are in interval \([0,2]\) which may induce numerical instability or gradient explode / vanishing, renormalization trick is introduced
where \(\tilde{D}_{ii} = \sum_j \tilde{A}_{ij}\)
the resulting graph convolutional layer is
where \(W^{(l)}\) is the parametrized object for optimization
which can be rewritten as the perceptron form
4.5 Word Embedding
word encoding methods :
(1) one-hot encoding --> ignored relationship between words
(2) word class --> classify words into classes --> ignored relationship between classes
(3) word embedding --> encode word into high dimensional space --> distance reflects the relationship between words
word embedding --> unsupervised process to learn the encodeing of one word
5 Sequence to Sequence
5.1 Attention
sophisticated input : inputs are a set of vectors. eg. text sequence (embedding methods : one-hot encodding, word embedding, ...), voice sequence, graph (social network, molecule)
outputs :
(1) each input has a label (POS tagging)
(2) the whole sequence has a label (sentiment analysis, speaker recognition, molecular properties)
(3) model decides the number of labels itself (translation, speech recognition)
** sequence labeling **
trivial network : fully-connected layers
question : label may be influenced by neighbor inputs --> sequence window --> whole sequence (too long to deal with) --> self attention
Pesudo Network : inputs --> self attention --> FC layers --> outputs
** relevance \(\alpha\) between inputs **
dot-product :
additive :
5.2 Transformer
** algorithms **
(1) query \(q^i = W^q a^i\), key \(k^j = W^k a^j\) --> attention score \(\alpha_{i,j} = q^i \cdot k^j\)
(2) softmax :
(3) value \(v^i = W^v a^i\) --> \(b^j = \sum_i\alpha_{j,i}'v^i\)
matrix representation :
(1) \(Q = W^q I\), \(K = W^k I\), \(V = W^v I\), \(I = \text{cat}(a_1, a_2, ..., a_n)\)
(2) \(\Alpha = K^\top Q\), \(\Alpha' = \text{softmax}(\Alpha)\)
(3) \(O = V \Alpha' = V \text{softmax}(K^\top Q)\)
** multi-head self attention **
(1) for head 1, query \(q^{i,1} = W^{q,1} a^i\), key \(k^{j,1} = W^{k,1} a^j\)
(2) \(\alpha_{i,j,1}' = \text{softmax}(\alpha_{i,j,1}) = \text{softmax}(q^{i,1} \cdot k^{j,1})\)
(3) value \(v^{i,1} = W^{v,1} a^i\) --> \(b^{j,1} = \sum_i\alpha_{j,i,1}'v^{i,1}\)
(4) \(o^j = W^o \times \text{cat}(b^{j,1}, b^{j,2}, ..., b^{j,m})\)
** positional encoding **
each position has a unique positional vector \(e^i\) --> \(a^i = e^i + in^i\)
** self-attention v.s. CNN **
CNN : self-attention that can only attends in a receptive field --> CNN is the simlified self-attention
self-attention : CNN with learnable receptive field --> self-attention is the complex version of CNN
** decoder **
autoregressive decoder --> output sequence word by word --> use END encode to stop output --> usually behaves better
non-autoregressive decoder --> output sequence one time --> use END encode to cutoff sequence --> parallel
5.3 Self-attention Variants
** domain-knowledge based **
local / truncated attention --> confine attention on receptive field
stride attention --> confine attention on skipped neighbors
where \(\Delta\) is the stride step
global attention --> add special tokens into original sequence which attend to every token (collect global information) and are attended by every token (they know global information) --> no attention between non-special tokens
clustring --> cluster queries and keys --> confine attention on the same cluster
different attention choices ? --> use all in different heads (I WANT ALL !!!)
** learning based **
sinkhorn sorting network --> learn a network that can transform input into a pre-attention matrix which will be operated into the binary attention
switch matrix (yes/no attention)
input sequence \(\mathbf{x} \in \mathbb{R}^N\) --(network)--> pre-attention matrix \(M^p \in \mathbb{R}^{N \times N}\) --(operation trick)--> \(M^s\{0,1\} \in \mathbb{R}^{N \times N}\)
where the operation trick is a differentiable transformation
to reduce the complexity of the trained network, the sequence is usually splitted into subsequences that share a pre-attention column, rather than tokens
synthesizer --> attention matrix as network parameter to learn
** matrix multiplication acceleration **
Linformer --> noticed that attention matrix is low-rank
assume that the length of query / key is \(t\), the length of value is \(t'\), the number of token is \(N\)
\(N\) keys --> \(n\) representative keys --> query matrix \(Q_{t \times N}\), key matrix \(K_{t \times n}\) --> attention matrix \(A_{n \times N}\)
\(N\) values --> \(n\) representative values --> value matrix \(V_{t' \times n}\) --> output matrix \(O_{t' \times N}\)
Quenstion : why not change the length of query ? --> query length equals to output length
linear transformer / performer
ignore the softmax, the self-attention network goes
two multiplication ways :
times of multiplication : (1) \((t+t')N^2\), (2) \(2tt'N\)
for usually \(N \gg t/t'\), the second path is much less cost than the first
put softmax back --> assume that \(\exp(q \cdot k) \approx \phi(q) \cdot \phi(k)\)
where we can notice that the left parts of the matrix multiplications are identical for every \(j\)
5.4 Non-autoregressive Sequence Generation
autoregressive model --> sequence generation token by token --> time is proportional to length of sequence
non-autoregressive model --> token generation not depend on other tokens --> multi-modality problem (mixture of many output modality)
** Vanilla NAT **
predict fertility as latent variable & copy input words --> represents sentence-level "plan" before output
sequence-level knowledge distillation
teacher : autoregressive model --> student : non-autoregressive model
construct new corpus by the teacher --> teacher's greedy decode output as student's training data
nosiy parallel decoding
sample several fertility sequences --> generate several sequences --> score by an autoregressive model to find the best sequence

浙公网安备 33010602011771号