学习理论中的特征选取与正则化
Learning Mathematical Theorem
Feature Selection
If we have n features, the VC dimension of hypothesis class would be O(n), if n>>m, over-fitting would be a potential problem.
wrapper feature selection \(O(n^2)\)
Give n features, there are \(2^n\) possible subsets. There's a heuristic search procedure is used to find a good feature subset called Forward Search:
-
Initialize F = ∅.
-
Repeat {
(a) For i = 1, . . . , n if i \(\notin\) F, let Fi = F ∪ {i}, and use some version of cross validation to evaluate features Fi . (I.e., train your learning algorithm using only the features in Fi , and estimate its generalization error.)
(b) Set F to be the best feature subset found on step (a).
}
-
Select and output the best feature subset that was evaluated during the entire search procedure.
Similarly, we have Backward Search starting from F={1,...,n} to F=∅
Filter feature selection
computationally cheaper.
The idea here is to compute some simple score S(i) that measures how informative each feature xi is about the class labels y. Then, we simply pick the k features with the largest scores S(i).
It is common to choose S(i) to be mutual information:
it can be expressed as a Kullback-Leibler (KL) divergence:
This gives a measure of how different the probability distributions p(xi , y) and p(xi)p(y) are.
Regularization
Frequentist School and Bayesian School
for \(\theta\), F think it is a constant that is unknown and we need to get an estimation for it, while B think it is a random unknown variable.
Maximum likelihood
MAP(maximum a posteriori)
commonly, \(\theta \sim \mathcal{N}(0,\tau^2I)\).
the term \(p(\theta)\) make the result has smaller norm