学习理论中的特征选取与正则化

Learning Mathematical Theorem

Learning Mathematical Theorem
- Feature Selection
  - wrapper feature selection \(O(n^2)\)
  - Filter feature selection
- Regularization
  - Frequentist School and Bayesian School

Feature Selection

If we have n features, the VC dimension of hypothesis class would be O(n), if n>>m, over-fitting would be a potential problem.

wrapper feature selection \(O(n^2)\)

Give n features, there are \(2^n\) possible subsets. There's a heuristic search procedure is used to find a good feature subset called Forward Search:

Initialize F = ∅.
Repeat {

(a) For i = 1, . . . , n if i \(\notin\) F, let Fi = F ∪ {i}, and use some version of cross validation to evaluate features Fi . (I.e., train your learning algorithm using only the features in Fi , and estimate its generalization error.)

(b) Set F to be the best feature subset found on step (a).

}
Select and output the best feature subset that was evaluated during the entire search procedure.

Similarly, we have Backward Search starting from F={1,...,n} to F=∅

Filter feature selection

computationally cheaper.

The idea here is to compute some simple score S(i) that measures how informative each feature xi is about the class labels y. Then, we simply pick the k features with the largest scores S(i).

It is common to choose S(i) to be mutual information:

\[\text{MI}(x_i,y)=\sum_{x_i\in \{0,1\}}\sum_{y\in \{0,1\}}p(x_i,y)\log \frac{p(x_i,y)}{p(x_i)p(y)} \]

it can be expressed as a Kullback-Leibler (KL) divergence:

\[\text{MI}(x_i,y)=\text{KL}(p(x_i,y)||p(x_i)p(y)) \]

This gives a measure of how different the probability distributions p(xi , y) and p(xi)p(y) are.

Regularization

Frequentist School and Bayesian School

for \(\theta\), F think it is a constant that is unknown and we need to get an estimation for it, while B think it is a random unknown variable.

Maximum likelihood

\[\theta_{ML}=\arg\max_{\theta} \prod_{i=1}^{m}p(y^{(i)}|x^{(i)};\theta) \]

MAP(maximum a posteriori)

\[\theta_{MAP}=\arg\max_{\theta} \prod_{i=1}^{m}p(y^{(i)}|x^{(i)},\theta)p(\theta) \]

commonly, \(\theta \sim \mathcal{N}(0,\tau^2I)\).

the term \(p(\theta)\) make the result has smaller norm

posted @ 2022-08-15 16:16 19376273 阅读(24) 评论(0) 编辑收藏举报

刷新页面返回顶部

Stargazer