Some Basic Concepts in Machine Learning
Model measurement
Error
Training error, Empirical error, Generalization error
Fitting
overfitting, underfitting
Measure methods
-
Hold-out: \(D=S \cup T, ~S\cap T=\emptyset , ~where ~S~means~\textit{Sample},~T~means~\textit{Training}.\) Stratified sampling.
-
Cross validation: \(D=\bigcup_k D_k,~D_i\cap D_j=\emptyset ~ (i\neq j),\) K-fold cross validation. \(when ~K=\# D, ~we ~call ~it~ LOO ~\): Leave-one-out.
-
Bootstraping: give samples set \(D\), \(\#D=m\), do stratified sampling with replacement on \(D\), each time take one case to \(D'\), Repeat \(m\) times, then we get a new set \(D' ~ (\#D' = m)\). \(D'\) becomes training set, \(D - D'\) becomes testing set. out-of-bag estimate.
-
Parameter tuning: step, validation set.
-
Performance measure:
-
Samples \(D=\{(x_1,y_1),...(x_m,y_m)\}\), \(x_i \to y_i\), learner \(f\), learning result \(f(x)\), real tag \(y\).
-
Regression: mean squared error.
\[E(f;D)=\frac{1}m\sum_{i=1}^m (f(x_i)-y_i)^2 ~. \\ \int_{x \sim d} (f(x)-y)^2p(x)dx ~. \] -
Error rate:
\[E(f;D)=\frac{1}m \sum_m\mathbb{I}(f(x_i)\neq y_i) ~. \]
-
Accuracy rate:
\[acc(f; D)=\frac{1}m\sum_m\mathbb{I}(f(x_i) = y_i) \]\[=1-E(f; D) \]
-
Precision and Recall:
\[P=\frac{TP}{TP+FP}\\ R=\frac{TP}{TP+FN} \]
- P-R curve
- Break-Even Point, BEP
- \(F_\beta\) measure:
\[\frac{1}{F_\beta} = \frac{1}{1+\beta^2}\left( \frac{1}{P}+\frac{\beta^2}{R} \right) \]
- macro-R, macro-P, macro-\(F_\beta\)
\[macro-R=\frac{1}n \sum_nP_i \\ macro-P=\frac{1}n \sum_nR_i \\ macro-F_\beta = F_\beta(macro-R, macro-P) \]- micro-R:
\[micro-R = \frac{\overline{TP}} {\overline{TP}+\overline{FP}}. \]
- ROC: Receiver Operating Characteristic curve, x-axis: FPR, y-axis: TPR, TPR=R,
\[FPR=\frac{FP}{TP+FP} ~. \]
-