【吴恩达】正则化

Chapter7 Regularization

The problem of overfitting

If we have too many features, the learned hypothesis may fit the training set very well, but fail to generalize to new examples.

addressing overfitting

  1. Reduce number of features
  • Manually select which features to keep.
  • Model selection algorithm.
  1. Regularization
  • Keep all the features, but reduce magnitude/values of parameters \(\theta_j\), which can make simpler hypothesis, smoother functions.

Cost function

\[J(\theta)=\frac{1}{2m} [\sum_1^m(h_\theta(x^{(i)})-y^{(i)})^2+\lambda\sum^n_{j=1}\theta_j^2] \]

\(\lambda\):regularization parameter, control a trade off between two different goals. The goal of fitting the training set well and the goal of keeping the parameter small to avoid overfitting.

We don't need to shrink \(\theta_0\), because \(\theta_0\) correspond to the constant term, which makes little influence to the overfitting.

Linear regression

Gradient descent

repeat until convergence{

\[\begin{aligned} \theta_0&=\theta_0-\alpha\frac{\partial}{\partial\theta_0}J(\theta)\\ &=\theta_j-\frac{\alpha}{m} \sum_1^m[h_\theta(x^{(i)})-y^{(i)}]x_0^{(i)}\\ \theta_j&=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)\\ &=\theta_j-\alpha[\frac{1}{m} \sum_1^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}+\frac{\lambda}{m}\theta_j]\\ &=\theta_j(1-\alpha\frac{\lambda}{m})-\frac{\alpha}{m} \sum_1^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)} \end{aligned} \]

( \(j=1,\cdots,n\))

}

\(1-\alpha\frac{\lambda}{m}<1\), but very close to \(1\). Because \(\alpha\) is small and \(m\) is large. multiplying this term means reducing the influence of \(\theta_j\).

Normal equation

\[\theta=(X^TX+\lambda \left[ \begin{matrix} 0&&&&\\ &1&&&\\ &&1&&\\ &&&\ddots&\\ &&&&1 \end{matrix} \right]_{(n+1)\times(n+1)} )^{-1}X^Ty \]

if \(\lambda>0\), we can prove that the matrix is invertible.

Logistic regression

\[\begin{aligned} J(\theta)=-\frac{1}{m}\sum_{i=1}^m[y^{(i)}\log(h_\theta(x^{(i)}))+(1-y^{(i)})\log(1-h_\theta(x^{(i)}))]+\frac{\lambda}{2m}\sum_{j=1}^n\theta_j^2 \end{aligned} \]

Gradient descent

repeat until convergence{

\[\begin{aligned} \theta_0&=\theta_0-\alpha\frac{\partial}{\partial\theta_0}J(\theta)\\ &=\theta_j-\frac{\alpha}{m} \sum_1^m[h_\theta(x^{(i)})-y^{(i)}]x_0^{(i)}\\ \theta_j&=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)\\ &=\theta_j-\alpha[\frac{1}{m} \sum_1^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}+\frac{\lambda}{m}\theta_j]\\ &=\theta_j(1-\alpha\frac{\lambda}{m})-\frac{\alpha}{m} \sum_1^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)} \end{aligned} \]

( \(j=1,\cdots,n\))

}

Words and expressions

ameliorate 改良

wiggly 弯曲的

posted @ 2022-03-17 22:01  煮茶听风雨  阅读(30)  评论(0编辑  收藏  举报