Deep Learning Week2 Notes
Loss and Risk
1. Classification & Regression
\(\text{I. Classification(e.g. }{\bf{Objection\ recognition, cancer\ detection, speech\ processing...}})\)
\(\text{II. Regression(e.g.}{\bf customer\ satisfaction, stock\ prediction, epidemiology})\)
\(\text{III. Density Estimation(e.g. }{\bf outlier\ detection, data\ visualization, sampling\ synthesis} )\)
\(\text{For classification task, a often intuitive interpretation is:}\)
\(\large \text{That is: draw } Y\text{ first, and given the value }y,\text{ generate }X.\)
\(\\\)
\(\large\text{The conditional distribution: }\mu_{X|Y=y},\text{ means the }{\bf the\ distribution\ of\ the\ observed\ signals\ for\ the\ class\ } y.\)
\(\\\)
\(\text{For regression task, one would interpret it as:}\)
\(\large \text{i.e.: first generate }X,\text{ then given the value }x,\text{ generate }Y.\)
\(\text{In the simple case:}\)
\(\\\)
2. Loss
\(\text{We are looking for }f\text{ with small expected risk:}\)
\(\text{But we can compute an estimation:}\)
\(\\\)
3. K-NN
\(\text{Under mild assumptions of regularities of }\mu_{X,Y},\text{for } N\rightarrow \infin \text{ the asymptotic error rate of the 1-NN is less than twice the (optimal!) Bayes’ Error rate.}\)
\(\large \text{It can be shown that when N → ∞ and when K grows at roughly the square root of N (i.e. grows slower than N), the asymptotic error rate
reaches the optimal Bayes’ error, because we look at more and more samples, but they are more and more geometrically localized.}\)
\(\\\)
\(\bf In\ detail:\)
\(\text{where Bayes-risk: }L(g_*) = P(g_*(X)\neq Y) = \mathbb{E}[\min(\eta(X),1-\eta(X))]\)
\(\\\)
\(\large\bf Interpretation:\)
\(\textbf{I. Fixed } d, N\rightarrow \infin:\mathbb{E}_{S_{train}}[L(g_{S_{train}})]\leq 2L(g_*)\)
\(\textbf{II. Fixed }N,d\rightarrow \infin: \textbf{Error increases exponentially fast.}\)
\(\\\)
4. Polynomials
\(\text{PyTorch Code:}\)
def fit_polynomial(D, x, y):
# Broadcasting magic
X = x[:, None] ** torch.arange(0, D + 1)[None]
# Least square solution
return torch.linalg.lstsq(X, y).solution
n[:, None]
will have the effect of inserting a new dimension on dim=1
. This is equivalent to n.unsqueeze(dim=1)
.
>>> n = torch.rand(3, 100, 100)
>>> n[:, None].shape
(3, 1, 100, 100)
>>> n.unsqueeze(1).shape
(3, 1, 100, 100)
\(\Large\textbf{More details, refer }\) Stack-Overflow
\(\\\)
Bias-Variance dilemma
\(\text{when the capacity increases, or the regularization decreases, the mean of the predicted value gets right on target, but the prediction varies more across runs. }\)
\(\text{Given the trained models }f_1,...,f_M,\text{ the empirical mean prediction:}\)
\(\text{empirical variance:}\)
\(\text{We have:}\)
\(\text{First term is } \textbf{variance}, \text{second is }\textbf{bias}.\)
\(\\\)
1. All Probs?
\(\text{Conceptually model-fitting and regularization can be interpreted as Bayesian inference.}\)
\(\text{Modeling the parameters }A \text{ of the model following a prior distribution }\mu_A.\)
\(\large\text{By looking at the data }\mathcal{D}, \text{ we can estimate the posterior distribution:}\)
\(\text{Example: a polynomial with Gaussian Prior:}\)
\(\text{where}\)
\(\text{In detail:}\)
\(\\\)
Clustering and Embeddings
\(\\\)
1. K-means
\(\text{First we initialized randomly }c_1^0,...,c_K^0,\text{ then followed by repeating until convergence:}\)
\(\\\)
2. PCA
PCA learns an \(\textbf{orthogonal, linear}\) transformation of the data that projects an input \(x\) to a representation \(z\).
\(\text{Given data points: }x_n\in\mathbb{R^D},n=1,...,N\)
\((A):\text{Compute the average and center the data:}\)
\((B):\text{for }t=1,...,D.\text{ Pick the direction and project the data:}\)
\(\large \text{A standard way to compute PCA relying on }\bf{eigen-decomposition:}\)
\(\text{means the centered data, we have:}\)
\(\large\text{From this we can derive that }v_1, v2, ... , v_D\text{ are the eigenvectors of } X^TX \text{ ranked according to the absolute values of their eigenvalues.}\)
\(\\\)
\(\large\textbf{In practice: to compute the PCA basis}\)
- \(\text{ Center the data by substracting the mean}\)
- \(\textbf{ Compute the eigen-decomposition of }X^TX \text{ where }X \text{ is the matrix of the }\textbf{row samples}\)
- \(\text{ Rank the eigen-vectors according to the absolute values of eigenvalues}\)
- \(v_1:\text{ is the first vector of PCA basis, }v_2\text{ is the second, etc.}\)