机器学习_李宏毅

notebook, shen fang, 2895044375@qq.com

Machine Learning

2022春机器学习，国立台湾大学，李宏毅

1 Introduction of Deep Learning

1.1 Machine Learning Intro

machine learning --> looking for function (so complex that cannot be written manually)

** different types of functions **
regression : the function outputs a scalar
classification : the function outputs correct options (classes)
structured learning : create something with structure (image, document)

** how to find the function **
model --> loss --> optimization
batch : divide samples into batchs, compute loss for one batch to compute gradients and update parameters for one optimization
epoch : trained all batchs

1.2 Deep Learning

** backpropagation **
for a fully-connected multi-layer perceptron, the \(i\)th layers behaves

\[y^{(i)}_k = \sigma\left(\sum_j w_{k,j}^{(i)} x_j^{(i)} + b_k^{(i)}\right) \]

where \(\sigma(z)\) is the activation function, \(i\) denotes layer index, \(k\) denotes feature index
considering the L2 loss function

\[L(\textbf{W},\textbf{b}) = \frac{1}{n} \left(\sum_{k=1}^n \frac{1}{2} \left(\hat{y}_k^{(m)}-y_k^{(m)}\right)^2\right) \]

where \(n\) is the number of output features, \(m\) is the number of layers
the backpropagation behaves

\[\frac{\partial L}{\partial w_{k,j}^{(m)}} = (y_k^{(m)}-\hat{y}_k^{(m)}) \frac{\partial \sigma(z)}{\partial z} y^{(m-1)}_{j} \]

\[\frac{\partial L}{\partial b_k^{(m)}} = (y_k^{(m)}-\hat{y}_k^{(m)}) \frac{\partial \sigma(z)}{\partial z} \]

\[\frac{\partial L}{\partial w_{k,j}^{(i)}} = \frac{1}{2n} \sum_{j=1}^n \frac{\partial \left(\hat{y}_j^{(m)}-y_j^{(m)}\right)^2}{\partial y^{(i)}_{k}} \frac{\partial \sigma(z)}{\partial z} y^{(i-1)}_{j} \]

\[\frac{\partial \left(\hat{y}_j^{(m)}-y_j^{(m)}\right)^2}{\partial y^{(i)}_{k}} = \sum_s \frac{\partial \left(\hat{y}_j^{(m)}-y_j^{(m)}\right)^2}{\partial y^{(i+1)}_{s}} \frac{\partial y^{(i+1)}_{s}}{\partial y^{(i)}_{k}} = \sum_s \frac{\partial \left(\hat{y}_j^{(m)}-y_j^{(m)}\right)^2}{\partial y^{(i+1)}_{s}} \frac{\partial \sigma(z)}{\partial z} w_{s,k}^{(i+1)} \]

\[\frac{\partial \left(\hat{y}_j^{(m)}-y_j^{(m)}\right)^2}{\partial y^{(i)}_{k}} \leftarrow \frac{\partial \left(\hat{y}_j^{(m)}-y_j^{(m)}\right)^2}{\partial y^{(i+1)}_{k}} \leftarrow ... \leftarrow \frac{\partial \left(\hat{y}_j^{(m)}-y_j^{(m)}\right)^2}{\partial y^{(m)}_{k}} \]

\[\frac{\partial L}{\partial b_{k}^{(i)}} = \frac{1}{2n} \sum_{j=1}^n \frac{\partial \left(\hat{y}_j^{(m)}-y_j^{(m)}\right)^2}{\partial y^{(i)}_{k}} \frac{\partial \sigma(z)}{\partial z} \]

then gradient descent behaves

\[\theta^{(\text{new})} \leftarrow \theta^{(\text{old})} - \eta \left. \frac{\partial L}{\partial \theta} \right|_{\theta^{(old)}} \]

1.3 Regression

** linear model **
model function :

\[y_i = \sum_{j=1}^n w_{ij} x_j + b_j \]

where \(n\) is the number of input feature
loss function :

\[L(\mathbf{W}, \mathbf{b}) = \frac{1}{N} \sum_{i=1}^N \frac{1}{2} \left[\hat{y}_i - \left(\sum_{j=1}^n w_{ij} x_j + b_j\right)\right]^2 \]

where \(N\) is the number of output feature
optimization :

\[\mathbf{w}^*,b^* = \arg \min_{\mathbf{w},b}L(\mathbf{w},b) \]

** model selection **
polynomial model for Pokemon CP --> higher order polynomial, more complex model, less training loss --> testing loss decreases when lower than 3 order, testing loss increases when higher than 3 order --> overfitting from complex model

1.4 Classification

** ideal model for binary classification **
model :
\(f(x) = \begin{cases} g(x)>0, & \text{if output = class1} \\ \text{else}, & \text{if output = class2} \end{cases}\)
loss : \(L(f) = \sum_i\delta(f(x_i) \ne \hat{y}_i)\)

** probabilistic generative model **
for a binary classification, given an \(x\), the probability that it belongs to \(C_1\) is

\[P(C_1|x) = \frac{P(x|C_1) P(C_1)}{P(x|C_1) P(C_1) + P(x|C_2) P(C_2)} \]

generative model : \(P(x) = P(x|C_1)P(C_1) + P(x|C_2)P(C_2)\)
assume that the dataset is sampled from a Gaussian distribution

\[f_{\mu,\Sigma}(x) = \frac{1}{(2\pi) ^ {D/2}} \frac{1}{|\Sigma| ^ {1/2}} \exp{\left[-\frac{1}{2} (x-\mu)^\top \Sigma^{-1} (x-\mu)\right]} \]

how to find the Gaussian distrtibution --> maximum likelihood

\[\mu^*,\Sigma^* = \arg \max_{\mu,\Sigma} \prod_{i=1}^N f_{\mu,\Sigma}(x_i) \]

where \(N\) is the sample number, and we can get that

\[\mu^* = \frac{1}{N}\sum_{i=1}^Nx_i, \Sigma^* = \frac{1}{N^2}\sum_{i=1}^N(x_i-\mu^*)(x_i-\mu^*)^\top \]

we have

\[P(x|C_k) = f_{\mu_k,\Sigma_k}(x), P(C_k) = \frac{n_{C_k}}{n_{\text{total}}}\]

model : \(P(C_1|x)>0.5\), class1; otherwise, class2

1.5 Logistic Regression

revisite the probalisitic generative model, we have

\[P(C_1|x) = \frac{f_{\mu_1,\Sigma_1}(x) n_{C_1}}{f_{\mu_1,\Sigma_1}(x) n_{C_1} + f_{\mu_2,\Sigma_2}(x) n_{C_2}} = \frac{1}{1 + \frac{f_{\mu_2,\Sigma_2}(x) n_{C_2}}{f_{\mu_1,\Sigma_1}(x) n_{C_1}}} \]

let

\[z = \ln \frac{f_{\mu_1,\Sigma_1}(x) n_{C_1}}{f_{\mu_2,\Sigma_2}(x) n_{C_2}} \]

we have the sigmoid function

\[P(C_1|x) = \frac{1}{1 + \exp{(-z)}} = \sigma(z) \]

considering the new variable

\[z = \ln \frac{e^{-\frac{1}{2} (x-\mu_1)^\top \Sigma_1^{-1} (x-\mu_1)}}{e^{-\frac{1}{2} (x-\mu_2)^\top \Sigma_2^{-1} (x-\mu_2)}} + \ln \frac{|\Sigma_2| ^ {1/2}}{|\Sigma_1| ^ {1/2}} + \ln \frac{n_{C_1}}{n_{C_2}} \]

usually \(\Sigma_1\) and \(\Sigma_2\) are assumed to be the same and we have

\[\begin{align*} z & = -\frac{1}{2} [(x-\mu_1)^\top \Sigma_1^{-1} (x-\mu_1) - (x-\mu_2)^\top \Sigma_2^{-1} (x-\mu_2)] + \ln \frac{n_{C_1}}{n_{C_2}} \\ & = -\frac{1}{2} [2\mu_2^\top \Sigma_2^{-1} x - 2\mu_1^\top \Sigma_1^{-1} x + \mu_1^\top \Sigma_1^{-1} \mu_1 - \mu_2^\top \Sigma_2^{-1} \mu_2] + \ln \frac{n_{C_1}}{n_{C_2}} \\ & = -(\mu_2^\top - \mu_1^\top)\Sigma^{-1} x -\frac{1}{2}[\mu_1^\top \Sigma^{-1} \mu_1 - \mu_2^\top \Sigma^{-1} \mu_2] + \ln \frac{n_{C_1}}{n_{C_2}} \\ & = W^\top x + b \end{align*} \]

then we can get the logistic regression model :

\[f_{w,b}(x) = \sigma(\sum_i w_i x_i + b) \]

loss : assume that \(f_{w,b}(x) = P_{w,b}(C_1|x)\), we have the loss

\[L(w,b) = \prod_{s=1}^mf_{w,b}(x^s|x^s\in C_1) \prod_{t=1}^{N-m}(1-f_{w,b}(x^t|x^t\in C_2)) \]

optimization :

\[w^*,b^* = \arg \max_{w,b} L(\mathbf{w},b) = \arg \min_{w,b} \left(-\ln L(w,b)\right) \\ =\arg \min_{w,b} -\ln\left[\sum_{s=1}^mf_{w,b}(x^s|x^s\in C_1) + \sum_{t=1}^{N-m}(1-f_{w,b}(x^t|x^t\in C_2))\right] \]

let \(\hat{y}(x^s|x^s\in C_1) = 1\), \(\hat{y}(x^s|x^s\in C_2) = 0\), we have

\[-\ln l(x^s) = -\left[\hat{y}^s\ln f(x^s) + (1-\hat{y}^s)\ln(1-f(x^s))\right]\]

the loss function becomes

\[-\ln L(w,b) = \sum_{s=1}^N -\left[\hat{y}^s \ln f_{w,b}(x^s) + (1-\hat{y}^s)\ln (1 - f_{w,b}(x^s))\right] \]

where the summed term is the cross entropy of two Bernoulli distributions

\[p(z) = \begin{cases} \hat{y}^s, & z=1 \\ 1-\hat{y}^s, & z=0 \end{cases} \]

\[q(z) = \begin{cases} f(x^s), & z=1 \\ 1-f(x^s), & z=0 \end{cases}\]

\[H(p,q) = -\sum_z p(z)\ln q(z) \]

gradient descent :

\[\frac{\partial \ln f_{w,b}(x)}{\partial w_i} = \frac{\partial \ln f_{w,b}(x)}{\partial z} \frac{\partial z}{\partial w_i} = \frac{\partial \ln \sigma(z)}{\partial z} \frac{\partial z}{\partial w_i} = (1-\sigma(z))x_i \]

\[\frac{\partial \ln (1-f_{w,b}(x))}{\partial w_i} = \frac{\partial \ln (1-\sigma(z))}{\partial z} \frac{\partial z}{\partial w_i} = -\sigma(z)x_i \]

\[\begin{align*} -\frac{\partial \ln L(w,b)}{\partial w_i} & = \sum_{s=1}^N -\left[\hat{y}^s(1-f(x^s))x^s_i - (1-\hat{y}^s)f(x^s)x^s_i\right] \\ & = \sum_{s=1}^N -\left[\hat{y}^ix^s_i - \hat{y}^sf(x^s)x^s_i - f(x^s)x^s_i + \hat{y}^sf(x^s)x^s_i\right] \\ & = \sum_{s=1}^N -\left(\hat{y}^s - f(x^s)\right)x^s_i \end{align*} \]

Question : why cross entropy rather than L2 loss ?
for L2 loss, we have

\[\begin{align*} -\frac{\partial \ln L(w,b)}{\partial w_i} & = \sum_{s=1}^N (\hat{y}^s - f(x^s))\frac{\partial f_{w,b}(x)}{\partial z} \frac{\partial z}{\partial w_i} \\ & = \sum_{s=1}^N (\hat{y}^s - f(x^s))f(x^s)(1-f(x^s))x^s_i \end{align*} \]

if \(\hat{y}^s=1\), \(f(x^s)=1\), gradient descent will be slow
if \(\hat{y}^s=1\), \(f(x^s)=0\), gradient descent will also be slow (originating from sigmoid function)

2 What to do if My Network Fails to Train

2.1 Machine Learning Strategy

graph LR A[loss on training data] A -- large --> B[underfitting] B --> C[model bias] --> J[more complex model] B --> D[optimization] --> M[gradient descent] A -- small --> E[loss on testing data] E -- large --> L[mismatch] E -- large --> F[overfitting] F --> H[simpler model] F --> G[more training data] F --> K[data augmentation] E -- small --> I[good result]

considering optimization : deeper network behaves worse than shallower network
model constraint for overfitting : less parameter; less feature; early stopping; regularization; dropout; n-fold cross validation; ...
mismatch : training data and testing data are not in the same distribution

2.2 Optimization Strategy 1 -- Critical Points

For a loss function \(L(\theta)\), we have the Taylor expansion at \(\theta=\theta'\)

\[L(\theta) \approx L(\theta') + (\theta-\theta')^\top g + \frac{1}{2}(\theta-\theta') H (\theta-\theta')^\top \]

where \(g\) is the gradient, which equals to zero at critical points

\[g_i = \frac{\partial L(\theta')}{\partial \theta_i} \]

and \(H\) is the Hassian matrix

\[H_{ij} = \frac{\partial^2 L(\theta')}{\partial\theta_i \partial\theta_j} \]

at critical point, we have

class	for all \((\theta-\theta')\) \((\theta-\theta') H (\theta-\theta')^\top\)	eigenvalues of \(H\)
local maxima	\(<0\)	all negative
local minima	\(>0\)	all positive
saddle point	\(>0\) and \(<0\)	positive and negative

Comments : saddle points are much more frequently encountered than local minima in high dimensional space

2.3 Optimization Strategy 2 -- Batch, Momentum and Learning Rate

** optimization with batch **
bigger batch size optimizes faster, smaller batch size has better performance

** momentum **
movement --> movement of last step minus gradient at present

\[\mathbf{m}^{(i+1)} = \lambda \mathbf{m}^{(i)} - \eta \mathbf{g}^{(i)} \]

** learning rate **
parameter dependent learning rate

\[\theta_i^{(t+1)} = \theta_i^{(t)} - \frac{\eta}{\sigma_i^{(t)}} g_i^{(t)} \]

(1) Adagrad --> root mean square parameter

\[\sigma_i^{(t)} = \sqrt{\frac{1}{t+1} \sum_{j=0}^t \left|g_i^{(j)}\right|^2} \]

(2) RMSProp --> depend on previous parameter (Adam --> RMSProp + momentum)

\[\sigma_i^{(t)} = \sqrt{\alpha\sigma_i^{(t-1)} + (1-\alpha)\left|g_i^{(t)}\right|^2} \]

** learning rate scheduling **

\[\theta_i^{(t+1)} = \theta_i^{(t)} - \frac{\eta(\text{time})}{\sigma_i^{(t)}} g_i^{(t)} \]

(1) learning rate decay --> decay with time

\[\eta(t) = ae^{-bt} \]

(2) warm up --> increase and then decrease (at the beginning, the estimate of \(\sigma_i^{(t)}\) has large variance)

\[\eta(\text{step}) = a \cdot \min(\text{step}^b, \text{step}^{-c}) \]

2.4 Optimization Strategy 3 -- Batch Normalization

for smaples in a batch \({\mathbf{x}^1, \mathbf{x}^2, ..., \mathbf{x}^N}\), and \(\mathbf{x}^i = [x_1^i, x_2^i, ..., x_m^i]^\top\), normalize features in the sample as

\[\tilde{x}_j^i = \frac{x_j^i-\mu_j}{\sigma_j} \]

where

\[\mu_j = \frac{1}{N} \sum_{i=1}^N x_j^i \]

\[(\sigma_j)^2 = \frac{1}{N} \sum_{i=1}^N \left(x_j^i - \mu_j\right)^2 \]

such that mean of the normalized feature is 1 and variance is 0
training : compute moving average of the mean and variance in every batch \(\mathcal{B}\)

\[\bar{\mu} = p\bar{\mu} + (1-p)\mu^{\mathcal{B_t}} \]

\[\bar{\sigma}^2 = p\bar{\sigma}^2 + (1-p)(\sigma^{\mathcal{B_t}})^2 \]

testing : normalize testing sample with \(\bar{\mu}\) and \(\bar{\sigma}\)

\[\tilde{x}_j^i = \frac{x_j^i-\bar{\mu}_j}{\bar{\sigma}_j} \]

2.5 Training Dataset

considering the loss function \(L(h,\mathcal{D})\) of a model, where \(h\) is the parameter for the model, \(\mathcal{D}\) is the training set, for the best parameter of the model

\[h^{\text{all}} = \arg\min_h L(h,\mathcal{D}_{\text{all}}) \]

while usually training dataset \(\mathcal{D}_{\text{train}}\) is sampled from total dataset such that

\[h^{\text{train}} = \arg\min_h L(h,\mathcal{D}_{\text{train}}) \]

we hope that \(L(h^\text{train},\mathcal{D}_{\text{all}})\) and \(L(h^\text{all},\mathcal{D}_{\text{all}})\) are close, which can be formulated as

\[L(h^\text{train},\mathcal{D}_{\text{all}}) - L(h^\text{all},\mathcal{D}_{\text{all}}) \leq \delta \]

a "good" training set \(\mathcal{D}_{\text{train}}\) will sartisfy the follow property

\[\forall h \in \mathcal{H}, \left|L(h,\mathcal{D}_{\text{train}}) - L(h,\mathcal{D}_{\text{all}})\right| \leq \delta/2 = \varepsilon \]

\(\mathcal{Proof}\) :

\[\begin{align*} L(h^\text{train},\mathcal{D}_{\text{all}}) & \leq L(h^\text{train},\mathcal{D}_{\text{train}}) + \delta/2 \\ & \leq L(h^\text{all},\mathcal{D}_{\text{train}}) + \delta/2 \\ & \leq L(h^\text{all},\mathcal{D}_{\text{all}}) + \delta/2 + \delta/2 \\ & \leq L(h^\text{all},\mathcal{D}_{\text{all}}) + \delta \end{align*} \]

the probability of sampling a "bad" training set is

\[\begin{align*} P(\mathcal{D}_{\text{train}} \text{ is } bad) & = \bigcup_{h\in\mathcal{H}} P(\mathcal{D}_{\text{train}} \text{ is } bad \text{ due to } h) \\ & \leq \sum_{h\in\mathcal{H}} P(\mathcal{D}_{\text{train}} \text{ is } bad \text{ due to } h) \end{align*} \]

according to Hoeffding's inequality, we have

\[P(\mathcal{D}_{\text{train}} \text{ is } bad \text{ due to } h) \leq 2\exp(-2N\varepsilon^2) \]

where \(N\) is the number of samples from \(\mathcal{D}_{\text{train}}\), and we have

\[P(\mathcal{D}_{\text{train}} \text{ is } bad) \leq \sum_{h\in\mathcal{H}} 2\exp(-2N\varepsilon^2) = \left|\mathcal{H}\right| 2\exp(-2N\varepsilon^2) \]

to make the probability of sampling a "bad" training set small, we could select larger \(N\) and smaller \(|\mathcal{H}|\)
trade-off of model complexity : small \(|\mathcal{H}|\) --> small gap between idea and reality but bad reality; big \(|\mathcal{H}|\) --> good reality but big gap between idea and reality

3 Image as Input

3.1 Convolutional Neural Network

image input --> height \(\times\) width \(\times\) channel
observation 1 : receptive field --> neuron / filter for detecting small patterns (kernel size, stride, padding)
observation 2 : the same patterns appear in different regions / each filter convolves over the input image --> parameter sharing (same parameters for every neuron / filter)
receptive field + parameter sharing --> convolutional layer

3.2 Spatial Transformer Layer

CNN is not invariant to scaling and rotation --> transform feature map before CNN

\[a_{mn}^{(l)} = \sum_i \sum_j w_{mn,ij}^{(l)} a_{ij}^{(l-1)} \]

for any affine transformation of an image

\[\begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} a & b \\ c & d \end{bmatrix} \begin{bmatrix} x' \\ y' \end{bmatrix} + \begin{bmatrix} e \\ f \end{bmatrix} \]

which means that transformed pixel \(a_{x'y'}^{(l)}\) are derived from original pixel \(a_{xy}^{(l-1)}\)
considering the integer-index of transformed pixel, original index \(x\) and \(y\) may be non-integer
if we simply round the non-integer index, gradient descent could be unenabled, e.g.

\[[x', y'] \leftarrow \text{round}([x,y]) = [a,b] \]

when parameters of STL changes, we have

\[[x', y'] \leftarrow \text{round}([x+\Delta x,y+\Delta y]) = [a,b] \]

which means the map doesn't change with parameters such that the gradient will be zero
interpolation : transformed pixels are interpolations of neighborhoods of original pixels

\[[x', y'] \leftarrow [x,y] \]

\[a_{x'y'}^{(l)} = \sum_{i=\text{ceil}(x)}^{\text{floor}(x)} \sum_{j=\text{ceil}(y)}^{\text{floor}(y)} (1-|i-x|) \cdot (1-|j-y|) \cdot a_{ij}^{(l-1)} \]

where the transformed pixels will change with parameters due to the changed map

4 Sequence as Input

4.1 Recurrent Neural Network

save memory of sequence as hidden state (Elman Network)

\[h_i = \phi_h(W_{hx}x_i + W_{hh}h_{i-1} + b_h) \]

\[o_i = \phi_o(W_{oh}h_i + b_o) \]

** bidiractional RNN ** --> forward + backward

\[h_i^{(f)} = \phi_h^{(f)}(W_{hx}^{(f)}x_i + W_{hh}^{(f)}h_{i-1}^{(f)} + b_h^{(f)}) \]

\[h_i^{(b)} = \phi_h^{(b)}(W_{hx}^{(b)}x_i + W_{hh}^{(b)}h_{i-1}^{(b)} + b_h^{(b)}) \]

\[o_i = \phi_o(W_{oh}^{(f)}h_i^{(f)} + W_{oh}^{(b)}h_i^{(b)} + b_o) \]

** long short-term memory (LSTM) **
inputs --> input gate --> memory cell (with forget gate) --> output gate --> outputs
the input \(z\) goes through the network as

\[\begin{align*} z^{(t)} & \xrightarrow{\text{embedding}} g(z^{(t)}) \xrightarrow{\text{input gate}} g(z^{(t)})f(z_i^{(t)}) \\ & \xrightarrow{\text{memory cell}} c^{(t)} = g(z^{(t)})f(z_i^{(t)}) + c^{(t-1)}f(z_f^{(t)}) \\ & \xrightarrow{\text{hiddens}} h^{(t)} = h(c^{(t)}) \xrightarrow{\text{output gate}} h^{(t)}f(z_o^{(t)}) \end{align*} \]

where activation function \(f\) usually is a sigmoid function, \(z_i\), \(z_f\) and \(z_o\) are the inputs of 3 gates mentioned above
for the complete version of LSTM, inputs are the cancatenation of hidden states and peephole

\[\{z^{(t)}, z_i^{(t)}, z_f^{(t)}, z_o^{(t)}\} = \phi(c^{(t-1)} || h^{(t-1)} || x^{(t)}) \]

strength : can deal with gradient vanishing (not gradient explode) --> memory and input are added such that gradient may be bigger than 1

4.2 Graph Neural Network

graph : molecule, subway map, social network, ... --> node + edge
GNN : classification (molecule classifier), generation (durg design), ...

4.3 Spatial-based GNN

forward --> layer \(t\) \(\xrightarrow{\text{spatial-based convolution}}\) layer \(t+1\)
aggregate --> update hidden states of one node with its neighbor nodes
readout --> collect all node features to generate graph features

** NN4G (Neural network for graph) **
embedding -->

\[h_i^{(0)} = W_i^{(0)} x_i \]

aggregate -->

\[h_i^{(t+1)} = W_i^{(t+1)} h_i^{(t)} + V_i^{(t+1)} \sum_{j \in \mathcal{N}(i)} h_j^{(t)} \]

readout -->

\[y = \sum_t U_t \left(\frac{1}{N} \sum_{i=1}^N h_i^{(t)}\right) \]

** MoNET (Mixture model network) **
define the distance between two nodes \(\mathbf{u}(i,j)\), we can reformulate the aggragate block with weighted sum

\[h_i^{(t+1)} = W_i^{(t+1)} h_i^{(t)} + V_i^{(t+1)} \sum_{j \in \mathcal{N}(i)} F(\hat{u}_{i,j}) \cdot h_j^{(t)} \]

** GAT (Graph attention network) **
compute attention of neighbor nodes for weighted sum of aggregate block

\[e_{i,j}^{(t)} = f(h_i^{(t)}, h_j^{(t)}) \]

\[h_i^{(t+1)} = W_i^{(t+1)} h_i^{(t)} + V_i^{(t+1)} \sum_{j \in \mathcal{N}(i)} e_{i,j}^{(t)} \cdot h_j^{(t)} \]

** GIN (Graph isomorphism network) **
proved that the sum of neighbor node features works better than mean or max-pooling, such that the best aggregate block behaves

\[h_i^{(t+1)} = \text{MLP}^{(t+1)} \left(\left(1+\varepsilon^{(t+1)}\right)h_i^{(t)} + \sum_{j \in \mathcal{N}(i)} h_j^{(t)}\right) \]

4.4 Spectral-based GNN

** signal processing **
synthesis --> \(A = \sum a_k \hat{v}_k\), analysis --> \(a_j = A \cdot \hat{v}_j\), where \(\hat{v}_i\) is assumed to be orthogonal basis
considering the signal \(x(t)\) formulated in time domain

\[x(t) = \int_{-\infin}^\infin x(\tau) \delta(t-\tau) \text{d}\tau \]

where \(\delta(t-\tau)\) is the basis
for the signal \(x(t)\) in frequency domain we have

\[x(t) = \frac{1}{2\pi} \int_{-\infin}^\infin X(j\omega) e^{j\omega t} \text{d}\omega \]

where \(e^{j\omega t}\) is the basis
Fourier transform --> analysis in frequency domain :

\[X(j\omega) = \int_{-\infin}^\infin x(t) e^{-j\omega t} \text{d}t \]

** spectral graph theory **
for an undirected graph \(\mathcal{G}=(V,E)\) and \(N=|V|\), define
(1) adjacency matrix \(A \in \mathbb{R}^{N \times N}\) :
\(A_{i,j} = \begin{cases} 0, & e_{i,j} \notin E \\ w(v_i,v_j), & e_{i,j} \in E \end{cases} \) , which is symmetric
(2) degree matrix \(D \in \mathbb{R}^{N \times N}\) :
\( D_{i,j} = \begin{cases} \sum_k A_{i,k}, & i=j \\ 0, & i \ne j \end{cases} \) , which is diagonal
(3) signal on graph \(f:V \rightarrow \mathbb{R}^N\)
(4) graph Laplacian \(L=D-A\) , which is positive semi-definite ( WHY ? --> \(f^\top Lf \geq 0\) )
spectral decomposition :

\[L = U \Lambda U^\top \]

\[\Lambda = \text{diag}(\lambda_0, ..., \lambda_{N-1}) \in \mathbb{R}^{N \times N} \]

\[U = [\mathbf{u}_0, ..., \mathbf{u}_{N-1}] \in \mathbb{R}^{N \times N} \]

where \(\lambda_i\) is called frequency and \(\mathbf{u}_i\) is the corresponding basis
operate \(L\) on the graph \(\mathcal{G}\), we have

\[Lf = (D-A)f = \left[..., \sum_{v_j \in V} w(v_i,v_j)(f(v_i) - f(v_j)), ...\right]^\top \]

\[\begin{align*} f^\top Lf & = \sum_{v_i \in V} f(v_i) \sum_{v_j \in V} w(v_i,v_j)(f(v_i) - f(v_j)) \\ & = \sum_{v_i \in V} \sum_{v_j \in V} w_{i,j}(f(v_i)^2 - f(v_i)f(v_j)) \\ & = \frac{1}{2} \sum_{v_i \in V} \sum_{v_j \in V} w_{i,j}(f(v_i)^2 - f(v_i)f(v_j) + f(v_j)^2 - f(v_j)f(v_i)) \\ & = \frac{1}{2} \sum_{v_i \in V} \sum_{v_j \in V} w_{i,j}(f(v_i) - f(v_j))^2 \end{align*} \]

which denotes the "power" of signal variation between vertices
for basis \(\mathbf{u}_i\) as graph signal, we have

\[\mathbf{u_i}^\top L \mathbf{u_i} = \mathbf{u_i}^\top \lambda_i \mathbf{u_i} = \lambda_i \mathbf{u_i}^\top \mathbf{u_i} = \lambda_i \]

which shows that large frequency corresponds to large signal variation
graph Fourier transform --> \(\hat{x} = U^\top x\), \(\hat{x}_i = \mathbf{u_i}^\top x\) (seems like vector projection)
inverse graph Fourier transform --> \(x = U \hat{x} = \sum_k \mathbf{u_k} \hat{x}_k\)
filtering --> \(\hat{y} = g_\theta (\Lambda) \hat{x}\)
the total network goes through

\[y = U g_\theta (\Lambda) U^\top x = g_\theta (U \Lambda U^\top) x = g_\theta(L) x \]

where \(g_\theta(L)\) is the optimization object

** ChebNet **
use polynomial to parametrize \(g_\theta(L)\)

\[g_\theta(L) = \sum_{k=0}^K \theta_k L^k = U \left(\sum_{k=0}^K \theta_k \Lambda^k\right) U^\top \]

where the number of parameters to learn is fixed to be \(K\) and the graph is made \(K\)-localized
Problem --> \(O(N^2)\) complexity
Solution --> use Chebyshev polynomial which is computationally recursive as polynomial kernel

\[T_0(x) = 1, T_1(x) = x, x\in[-1,1] \]

\[T_k(x) = 2xT_{k-1}(x) - T_{k-2}(x) \]

adjust frequency matrix for Chebyshev consition

\[\tilde{\Lambda} = \frac{2\Lambda}{\lambda_{\text{max}}} - 1, \tilde{\lambda} \in [-1,1] \]

\[T_0(\tilde{\Lambda}) = I, T_1(\tilde{\Lambda}) = \tilde{\Lambda} \]

\[T_k(\tilde{\Lambda}) = 2\tilde{\Lambda}T_{k-1}(\tilde{\Lambda}) - T_{k-2}(\tilde{\Lambda}) \]

then the optimization object becomes

\[g_{\theta'}(\tilde{\Lambda}) = \sum_{k=0}^K \theta_k' T_k(\tilde{\Lambda}) \]

the output goes

\[y = g_{\theta'}(L)x = \sum_{k=0}^K \theta_k' T_k(\tilde{L}) x = \sum_{k=0}^K \theta_k' \bar{x}_k \]

where the total complexity becomes \(O(K|E|)\)

** GCN (Graph convolutional network) **
normalized graph Laplacian \(L^{\text{norm}} = D^{-\frac{1}{2}}LD^{-\frac{1}{2}} = I_N - D^{-\frac{1}{2}}AD^{-\frac{1}{2}}\)
the output goes

\[\begin{align*} y &= g_{\theta'}(L_{\text{norm}})x & ** \text{ ignore norm below }\\ &= \theta_0'x + \theta_1'\tilde{L}x & **\text{ assumed } K=1 \\ &= \theta_0'x + \theta_1'\left(\frac{2L}{\lambda_{\text{max}}}-I\right)x & ** \tilde{L} = \frac{2L}{\lambda_{\text{max}}}-I \\ &= \theta_0'x + \theta_1'(L-I)x & **\lambda_{\text{max}} \approx 2 \text{ for } L_{\text{norm}} \\ &= \theta_0'x - \theta_1'\left(D^{-\frac{1}{2}}AD^{-\frac{1}{2}}\right)x & ** L = I - D^{-\frac{1}{2}}AD^{-\frac{1}{2}} \\ &= \theta \left(I+D^{-\frac{1}{2}}AD^{-\frac{1}{2}}\right)x & ** \text{ assumed } \theta = \theta_0' = -\theta_1' \end{align*} \]

for the eigenvalues of \(I+D^{-\frac{1}{2}}AD^{-\frac{1}{2}}\) are in interval \([0,2]\) which may induce numerical instability or gradient explode / vanishing, renormalization trick is introduced

\[I+D^{-\frac{1}{2}}AD^{-\frac{1}{2}} \xrightarrow{\tilde{A}=A+I_N} \tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} \]

where \(\tilde{D}_{ii} = \sum_j \tilde{A}_{ij}\)
the resulting graph convolutional layer is

\[H^{(l+1)} = \sigma \left(\tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} H^{(l)} W^{(l)}\right) \]

where \(W^{(l)}\) is the parametrized object for optimization
which can be rewritten as the perceptron form

\[h_v^{(l+1)} = f\left(\frac{1}{|\mathcal{N}_v|} \sum_{u \in \mathcal{N}_v} W_{vu}^{(l)} h_u^{(l)} + b_v^{(l)}\right) \]

4.5 Word Embedding

word encoding methods :
(1) one-hot encoding --> ignored relationship between words
(2) word class --> classify words into classes --> ignored relationship between classes
(3) word embedding --> encode word into high dimensional space --> distance reflects the relationship between words
word embedding --> unsupervised process to learn the encodeing of one word

5 Sequence to Sequence

5.1 Attention

sophisticated input : inputs are a set of vectors. eg. text sequence (embedding methods : one-hot encodding, word embedding, ...), voice sequence, graph (social network, molecule)
outputs :
(1) each input has a label (POS tagging)
(2) the whole sequence has a label (sentiment analysis, speaker recognition, molecular properties)
(3) model decides the number of labels itself (translation, speech recognition)

** sequence labeling **
trivial network : fully-connected layers
question : label may be influenced by neighbor inputs --> sequence window --> whole sequence (too long to deal with) --> self attention
Pesudo Network : inputs --> self attention --> FC layers --> outputs

** relevance \(\alpha\) between inputs **
dot-product :

\[q = W^q \times a_1;\ k = W ^k \times a_2;\ \alpha = q \cdot k \]

additive :

\[q = W^q \times a_1;\ k = W ^k \times a_2;\ \alpha = W^\alpha \times \tanh(q+k) \]

5.2 Transformer

** algorithms **
(1) query \(q^i = W^q a^i\), key \(k^j = W^k a^j\) --> attention score \(\alpha_{i,j} = q^i \cdot k^j\)
(2) softmax :

\[\alpha_{i,j}' = \frac{\exp(\alpha_{i,j})}{\sum_k\exp(\alpha_{i,k})} \]

(3) value \(v^i = W^v a^i\) --> \(b^j = \sum_i\alpha_{j,i}'v^i\)
matrix representation :
(1) \(Q = W^q I\), \(K = W^k I\), \(V = W^v I\), \(I = \text{cat}(a_1, a_2, ..., a_n)\)
(2) \(\Alpha = K^\top Q\), \(\Alpha' = \text{softmax}(\Alpha)\)
(3) \(O = V \Alpha' = V \text{softmax}(K^\top Q)\)

** multi-head self attention **
(1) for head 1, query \(q^{i,1} = W^{q,1} a^i\), key \(k^{j,1} = W^{k,1} a^j\)
(2) \(\alpha_{i,j,1}' = \text{softmax}(\alpha_{i,j,1}) = \text{softmax}(q^{i,1} \cdot k^{j,1})\)
(3) value \(v^{i,1} = W^{v,1} a^i\) --> \(b^{j,1} = \sum_i\alpha_{j,i,1}'v^{i,1}\)
(4) \(o^j = W^o \times \text{cat}(b^{j,1}, b^{j,2}, ..., b^{j,m})\)

** positional encoding **
each position has a unique positional vector \(e^i\) --> \(a^i = e^i + in^i\)

** self-attention v.s. CNN **
CNN : self-attention that can only attends in a receptive field --> CNN is the simlified self-attention
self-attention : CNN with learnable receptive field --> self-attention is the complex version of CNN

** decoder **

graph A[BEGIN] --> B{AT Decoder} C[w1] --> B D[w2] --> B E[w3] --> B B --> F[w1] B --> G[w2] B --> H[w3] B --> I[END] J[BEGIN] --> K{NAT Decoder} L[BEGIN] --> K M[BEGIN] --> K N[BEGIN] --> K K --> O[w1] K --> P[w2] K --> Q[END] K --> R[w4]

autoregressive decoder --> output sequence word by word --> use END encode to stop output --> usually behaves better
non-autoregressive decoder --> output sequence one time --> use END encode to cutoff sequence --> parallel

5.3 Self-attention Variants

** domain-knowledge based **
local / truncated attention --> confine attention on receptive field

\[\alpha_{i,j} = \begin{cases} q^i \cdot k^j, & j \in \mathcal{N}(i) \\ 0, & j \notin \mathcal{N}(i) \end{cases} \]

stride attention --> confine attention on skipped neighbors

\[\alpha_{i,j} = \begin{cases} q^i \cdot k^j, & d(j) - d(i) = k\Delta (\forall k \in \mathbb{N}^+) \\ 0, & else \end{cases} \]

where \(\Delta\) is the stride step

global attention --> add special tokens into original sequence which attend to every token (collect global information) and are attended by every token (they know global information) --> no attention between non-special tokens

clustring --> cluster queries and keys --> confine attention on the same cluster

different attention choices ? --> use all in different heads (I WANT ALL !!!)

** learning based **
sinkhorn sorting network --> learn a network that can transform input into a pre-attention matrix which will be operated into the binary attention
switch matrix (yes/no attention)
input sequence \(\mathbf{x} \in \mathbb{R}^N\) --(network)--> pre-attention matrix \(M^p \in \mathbb{R}^{N \times N}\) --(operation trick)--> \(M^s\{0,1\} \in \mathbb{R}^{N \times N}\)

\[\alpha_{i,j} = \begin{cases} q^i \cdot k^j, & M^s_{ij} = 1 \\ 0, & M^s_{ij} = 0 \end{cases} \]

where the operation trick is a differentiable transformation
to reduce the complexity of the trained network, the sequence is usually splitted into subsequences that share a pre-attention column, rather than tokens

synthesizer --> attention matrix as network parameter to learn

** matrix multiplication acceleration **
Linformer --> noticed that attention matrix is low-rank
assume that the length of query / key is \(t\), the length of value is \(t'\), the number of token is \(N\)
\(N\) keys --> \(n\) representative keys --> query matrix \(Q_{t \times N}\), key matrix \(K_{t \times n}\) --> attention matrix \(A_{n \times N}\)
\(N\) values --> \(n\) representative values --> value matrix \(V_{t' \times n}\) --> output matrix \(O_{t' \times N}\)
Quenstion : why not change the length of query ? --> query length equals to output length

linear transformer / performer
ignore the softmax, the self-attention network goes

\[O_{t' \times N} = V_{t' \times N} \otimes K^\top_{N \times t} \otimes Q_{t \times N} \]

two multiplication ways :

\[A_{N \times N} = K^\top_{N \times t} \otimes Q_{t \times N} \Rightarrow O_{t' \times N} = V_{t' \times N} \otimes A_{N \times N} \]

\[B_{t' \times t} = V_{t' \times N} \otimes K^\top_{N \times t} \Rightarrow O_{t' \times N} = B_{t' \times t} \otimes Q_{t \times N} \]

times of multiplication : (1) \((t+t')N^2\), (2) \(2tt'N\)
for usually \(N \gg t/t'\), the second path is much less cost than the first
put softmax back --> assume that \(\exp(q \cdot k) \approx \phi(q) \cdot \phi(k)\)

\[b^j = \sum_{i=1}^N \alpha'_{j,i} v^i = \frac{\sum_{i=1}^N [\phi(q^j) \cdot \phi(k^i)] v^i}{\sum_{l=1}^N \phi(q^j) \cdot \phi(k^l)} \]

\[\begin{align*} \sum_{i=1}^N [\phi(q^j) \cdot \phi(k^i)] v^i &= \sum_{i=1}^N \left(\sum_{l=1}^M \phi_l(q^j) \phi_l(k^i)\right) v^i = \sum_{l=1}^M \phi_l(q^j) \left(\sum_{i=1}^N \phi_l(k^i) v^i\right) \\ &= \begin{bmatrix} \sum_{i=1}^N \phi_1(k^i) v^i & \cdots & \sum_{i=1}^N \phi_M(k^i) v^i \end{bmatrix} \begin{bmatrix} \phi_1(q^j) \\ \phi_2(q^j) \\ \vdots \\ \phi_M(q^j) \end{bmatrix} \end{align*} \]

\[\sum_{l=1}^N \phi(q^j) \cdot \phi(k^l) = \phi(q^j) \cdot \sum_{l=1}^N \phi(k^l) = \begin{bmatrix} \sum_{l=1}^N \phi_1(k^l) & \cdots & \sum_{l=1}^N \phi_M(k^l) \end{bmatrix} \begin{bmatrix} \phi_1(q^j) \\ \phi_2(q^j) \\ \vdots \\ \phi_M(q^j) \end{bmatrix} \]

where we can notice that the left parts of the matrix multiplications are identical for every \(j\)

5.4 Non-autoregressive Sequence Generation

autoregressive model --> sequence generation token by token --> time is proportional to length of sequence
non-autoregressive model --> token generation not depend on other tokens --> multi-modality problem (mixture of many output modality)

** Vanilla NAT **
predict fertility as latent variable & copy input words --> represents sentence-level "plan" before output

graph A1[Hello] --> B(Encoder) A2[,] --> B A3[world] --> B A4[!] --> B B --> C1[2] B --> C2[1] B --> C3[2] B --> C4[1] D1[Hello] --> E(Decoder) D2[Hello] --> E D3[,] --> E D4[world] --> E D5[world] --> E D6[!] --> E E --> F1[你] E --> F2[好] E --> F3[,] E --> F4[世] E --> F5[界] E --> F6[!]

sequence-level knowledge distillation
teacher : autoregressive model --> student : non-autoregressive model
construct new corpus by the teacher --> teacher's greedy decode output as student's training data

nosiy parallel decoding
sample several fertility sequences --> generate several sequences --> score by an autoregressive model to find the best sequence

posted @ 2022-09-15 22:48 Eureka10shen 阅读(60) 评论(0) 收藏举报

刷新页面返回顶部

Eureka10shen