Deep Learning Week2 Notes

Loss and Risk

1. Classification & Regression

\(\text{I. Classification(e.g. }{\bf{Objection\ recognition, cancer\ detection, speech\ processing...}})\)

\(\text{II. Regression(e.g.}{\bf customer\ satisfaction, stock\ prediction, epidemiology})\)

\(\text{III. Density Estimation(e.g. }{\bf outlier\ detection, data\ visualization, sampling\ synthesis} )\)

\(\text{For classification task, a often intuitive interpretation is:}\)

\[\begin{align} \mu_{X,Y}(x,y) = \mu_{X|Y=y}(x)P(Y=y) \end{align} \]

\(\large \text{That is: draw } Y\text{ first, and given the value }y,\text{ generate }X.\)
\(\\\)
\(\large\text{The conditional distribution: }\mu_{X|Y=y},\text{ means the }{\bf the\ distribution\ of\ the\ observed\ signals\ for\ the\ class\ } y.\)

\(\\\)
\(\text{For regression task, one would interpret it as:}\)

\[\begin{align} \mu_{X,Y}(x,y) = \mu_{Y|X=x}(y)\mu_X(x) \end{align} \]

\(\large \text{i.e.: first generate }X,\text{ then given the value }x,\text{ generate }Y.\)
\(\text{In the simple case:}\)

\[\begin{align} Y = f(X)+\epsilon \end{align} \]

\(\\\)

2. Loss

\(\text{We are looking for }f\text{ with small expected risk:}\)

\[\begin{align} R(f) = \mathbb{E}_Z[l(f,Z)] \end{align} \]

\(\text{But we can compute an estimation:}\)

\[\begin{align} \hat{R}(f;\mathcal{D}) = \hat{\mathbb{E}}_{\mathcal{D}}[l(f,Z)]=\frac{1}{N} \sum_{n=1}^Nl(f,Z_n) \end{align} \]

\(\\\)

3. K-NN

\(\text{Under mild assumptions of regularities of }\mu_{X,Y},\text{for } N\rightarrow \infin \text{ the asymptotic error rate of the 1-NN is less than twice the (optimal!) Bayes’ Error rate.}\)

\(\large \text{It can be shown that when N → ∞ and when K grows at roughly the square root of N (i.e. grows slower than N), the asymptotic error rate reaches the optimal Bayes’ error, because we look at more and more samples, but they are more and more geometrically localized.}\)
\(\\\)
\(\bf In\ detail:\)

\[\begin{align} \mathbb{E}_{S_{train}}[L(g_{S_{train}})]&\leq 2L(g_*)+4c\sqrt{d}N^{-\frac{1}{d+1}} \end{align} \]

\(\text{where Bayes-risk: }L(g_*) = P(g_*(X)\neq Y) = \mathbb{E}[\min(\eta(X),1-\eta(X))]\)

\[g_*(X) = 1_{\eta(X)\geq 1/2},\eta(X) = P(Y=1|X) \]

\(\\\)
\(\large\bf Interpretation:\)
\(\textbf{I. Fixed } d, N\rightarrow \infin:\mathbb{E}_{S_{train}}[L(g_{S_{train}})]\leq 2L(g_*)\)
\(\textbf{II. Fixed }N,d\rightarrow \infin: \textbf{Error increases exponentially fast.}\)

\(\\\)

4. Polynomials

\[\begin{align} f(x;\alpha) = \sum_{d=0}^D\alpha_dx^d \end{align} \]

\(\text{PyTorch Code:}\)

def fit_polynomial(D, x, y):
# Broadcasting magic
    X = x[:, None] ** torch.arange(0, D + 1)[None]
# Least square solution
    return torch.linalg.lstsq(X, y).solution

n[:, None] will have the effect of inserting a new dimension on dim=1. This is equivalent to n.unsqueeze(dim=1).

>>> n = torch.rand(3, 100, 100)

>>> n[:, None].shape
(3, 1, 100, 100)

>>> n.unsqueeze(1).shape
(3, 1, 100, 100)

\(\Large\textbf{More details, refer }\) Stack-Overflow

\(\\\)

Bias-Variance dilemma

\(\text{when the capacity increases, or the regularization decreases, the mean of the predicted value gets right on target, but the prediction varies more across runs. }\)

\(\text{Given the trained models }f_1,...,f_M,\text{ the empirical mean prediction:}\)

\[\begin{align} \bar{f}(x) = \frac{1}{M}\sum_{m=1}^Mf_m(x) \end{align} \]

\(\text{empirical variance:}\)

\[\begin{align} \sigma(x) = \frac{1}{M-1}\sum_{m=1}^M[f(x)-\bar{f}(x)]^2 \end{align} \]

\(\text{We have:}\)

\[\begin{align} \mathbb{E}[(Y-y)^2]&=\mathbb{E}[Y^2-2Yy+y^2]\\ &=\mathbb{E}(Y^2)-2\mathbb{E}(Y)y+y^2\\ &=\mathbb{E}(Y^2)-\mathbb{E}(Y)^2+\mathbb{E}(Y)^2-2\mathbb{E}(Y)y+y^2\\ &=Var(Y)+[\mathbb{E}(Y)-y]^2 \end{align} \]

\(\text{First term is } \textbf{variance}, \text{second is }\textbf{bias}.\)
\(\\\)

1. All Probs?

\(\text{Conceptually model-fitting and regularization can be interpreted as Bayesian inference.}\)
\(\text{Modeling the parameters }A \text{ of the model following a prior distribution }\mu_A.\)

\(\large\text{By looking at the data }\mathcal{D}, \text{ we can estimate the posterior distribution:}\)

\[\begin{align} \mu_A(\alpha|\mathcal{D}=d)\sim \mu_{\mathcal{D}}(d|A=\alpha)\mu(\alpha) \end{align} \]

\(\text{Example: a polynomial with Gaussian Prior:}\)

\[\begin{align} Y_n = \sum_{d=0}^DA_dX_n^d+\Delta_n \end{align} \]

\(\text{where}\)

\[\begin{align} A_d\sim \mathcal{N}(0,\xi),X_n\sim\mu_X,\Delta_n\sim \mathcal{N}(0,\sigma) \end{align} \]

\(\text{In detail:}\)

\[\begin{align} \log{\mu_A(\alpha|\mathcal{D=d}}) &= \log{\frac{\mu_{\mathcal{D}(d|A=\alpha)}\mu_A(\alpha)}{\mu_{\mathcal{D}}(d)}}\\ &=\log{\prod_n\mu(x_n,y_n|A)}+\log{\mu_A}-\log{Z}\\ &=\log{\prod_n \mu(y_n|x_n,A)\mu(x_n|A)}+\log{\mu_A}-\log{Z}\\ &=\log{\prod_n \mu(y_n|x_n,A)}+\log{\mu_A}-\log{Z'}\\ &=-\frac{1}{2\sigma^2}\sum_n(y_n-\sum_d\alpha_dx_n^d)^2-\frac{1}{2\xi^2}\sum_d\alpha_d^2-\log{Z''} \end{align} \]

\(\\\)

Clustering and Embeddings

\(\\\)

1. K-means

\[\begin{align} \arg \min_{c_1,...c_K}\sum_n\min_k||x_n-c_k||^2 \end{align} \]

\(\text{First we initialized randomly }c_1^0,...,c_K^0,\text{ then followed by repeating until convergence:}\)

\[\begin{align} \forall n,k_n^t &=\arg\min_k||x_n-c_k^t||\\ \forall k,c_{k}^{t+1}&=\frac{1}{|n:k_n^t=k|}\sum_{n:k_n^t=k}x_n \end{align} \]

\(\\\)

2. PCA

PCA learns an \(\textbf{orthogonal, linear}\) transformation of the data that projects an input \(x\) to a representation \(z\).

\(\text{Given data points: }x_n\in\mathbb{R^D},n=1,...,N\)

\((A):\text{Compute the average and center the data:}\)

\[\begin{align} \bar{x} &= \frac{1}{N}\sum_nx_n\\ \forall n, x_n^{(0)} &= x_n-\bar{x} \end{align} \]

\((B):\text{for }t=1,...,D.\text{ Pick the direction and project the data:}\)

\[\begin{align} v_t = \arg\max_{||v||=1}\sum_n(v\cdot x_n^{(t-1)})^2\\ \forall n, x_n^{(t)} = x_n^{(t-1)}-(v_t\cdot x_n^{(t-1)})v_t \end{align} \]

\(\large \text{A standard way to compute PCA relying on }\bf{eigen-decomposition:}\)

\[X = \begin{pmatrix} ---x_1---\\ ---x_2---\\ .\\ .\\ ---x_n--- \end{pmatrix} \]

\(\text{means the centered data, we have:}\)

\[\begin{align} \sum_n(v\cdot x_n)^2 &=\left\| \begin{pmatrix} v\cdot x_1\\ .\\ .\\ v\cdot x_N \end{pmatrix}\right\|^2\\ &=\|vX^T\|^2\\ &=(vX^T)(vX^T)^T\\ &= v(X^TX)v^T \end{align} \]

\(\large\text{From this we can derive that }v_1, v2, ... , v_D\text{ are the eigenvectors of } X^TX \text{ ranked according to the absolute values of their eigenvalues.}\)
\(\\\)
\(\large\textbf{In practice: to compute the PCA basis}\)

\(\text{ Center the data by substracting the mean}\)
\(\textbf{ Compute the eigen-decomposition of }X^TX \text{ where }X \text{ is the matrix of the }\textbf{row samples}\)
\(\text{ Rank the eigen-vectors according to the absolute values of eigenvalues}\)
\(v_1:\text{ is the first vector of PCA basis, }v_2\text{ is the second, etc.}\)

posted on 2022-04-30 16:26 Blackzxy 阅读(24) 评论(0) 编辑收藏举报

刷新页面返回顶部

xinyu04

导航

公告