机器学习中的误差界分析
Error Theorem in ML
Basic Concepts
\(\hat{\varepsilon}\) is from finite data.
where \(g\) is the best hypothesis, \(h^*\) is the best hypothesis in your class, \(\hat{h}\) is the learned hypothesis from your limited training data. So
the Estimation Error can then break down to Estimation Var and Estimation Bias.
Empirical Risk Minimization
a learning algorithm
Uniform Convergence, think about these two relations
- \(\hat{\varepsilon}(h)\) and \(\varepsilon(h)\)
- \(\varepsilon(\hat{h})\) and \(\varepsilon(h^*)\)
Use 2 Lemma
-
Union Bound
\(P(A_1 \cup...\cup A_k) ≤ P(A_1) + ... + P(A_k)\)
-
Hoeffding Inequality
\(P(|\phi -\hat{\phi}| > \gamma) ≤ 2 \exp( 2{\gamma}^2 m)\)
Bound Analysis for Finite Hypothesis Space
let \(|\mathcal{H}|=k\), and let any \(m(\text{number of samples in training set}),\delta(\text{P of error set by human})\) be fixed. Then with probability at least \(1-\delta\), we have that
\(\gamma\) is the \(\sqrt{.}\) term.
the \(\hat{h}\) we get using training set is at most \(2\gamma\) higher than the best generalized hypothesis \(h\) when generalized.
If the hypothesis space \(\mathcal{H}\) increase, the 1st item decrease(Bias), while the 2nd item increase(Variance).
m (number of training examples) is increasing logarithmically with the size of hypothesis space(numbers).
Bound Analysis for Infinite Hypothesis Space
Vapnik-Chervonenkis dimension
\(VC(\mathcal{H})\) is the size of the largest set that is shattered by \(\mathcal{H}\). \(d = VCD(\mathcal{H})\)
"shatter" means for labels(like the answers for X in \(\mathcal{X}\)), there exists some \(h \in \mathcal{H}\) can satisfy.
m (number of training examples) is also increasing linearly with the size of \(\mathcal{H}\)(parameterized by VC dimension).