linear regression model

  • multiple input dimension: nultiple linear regression
    $y_n \approx f(\mathbf{x_n}){estimate} := \begin{bmatrix}1 & x&x_{n2}&\cdots&x_{nD} \end{bmatrix}\begin{bmatrix}\beta_0 \ \beta_1 \ \cdots \ \beta_D\end{bmatrix} \( \)= \mathbf{x_n^T \beta}$

  • considering N input data sample:

$\mathbf{y} = \begin{bmatrix}y_0 \ y_1 \ \vdots \ y_N \end{bmatrix} \approx f(\mathbf{x_n})_{estimate} $

$:= \begin{bmatrix}1 & \mathbf{x_{1}} & \mathbf{x_{2}} & \cdots & \mathbf{x_{D}} \end{bmatrix} \begin{bmatrix}\beta_0 \ \beta_1 \ \cdots \ \beta_D\end{bmatrix} $

\(= \begin{bmatrix}1 & x_{11} & x_{12} & \cdots & x_{1D} \\1 & x_{21} & x_{22} & \cdots & x_{2D} \\ 1 & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{N1} & x_{N2} & \cdots & x_{ND} \end{bmatrix} \begin{bmatrix}\beta_0 \\ \beta_1 \\ \cdots \\ \beta_D\end{bmatrix}\)

\(= \mathbf{\widetilde{X}_n^T}\mathbf{\beta}\)

  • consider the residual error:
    $\mathbf{y} $

\(= \begin{bmatrix}y_0 \\ y_1 \\ \vdots \\ y_N \end{bmatrix}\)

\(= \begin{bmatrix}1 & x_{11} & x_{12} & \cdots & x_{1D} \\1 & x_{21} & x_{22} & \cdots & x_{2D} \\ 1 & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{N1} & x_{N2} & \cdots & x_{ND} \end{bmatrix} \begin{bmatrix}\beta_0 \\ \beta_1 \\ \cdots \\ \beta_D\end{bmatrix} + \begin{bmatrix}\epsilon_0 \\ \epsilon_1 \\ \cdots \\ \\ \epsilon_D\end{bmatrix}\)

$ = \mathbf{\widetilde{X}_n^T}\mathbf{\beta} + \mathbf{\epsilon} $

  • Assume the error in Gaussian distribution ?? since that we do not know the exact error

\(\epsilon \sim \mathcal{N}(\mu,\sigma^2)\)

  • rewrite the linear regression model:
    for the ith data sample:
    $ p(y_i|\mathbf{x}_i,\theta) = p(y|\mathbf{x}_i,(\mathbf{w},\sigma^2_i)) \( \)= \mathcal{N}(\mu(\mathbf{x}_i),\sigma^2(\mathbf{x}_i)) $

$= \mathcal{N}(\mathbf{w}T\mathbf{x}_i,\sigma2(\mathbf{x}_i)) $

in our notation:
$p(y|\mathbf{x},\theta)=p(y|\mathbf{x},(\mathbf{w},\sigma2))=\mathcal{N}(\mathbf{wT X},\mathbf{\epsilon}^2(\mathbf{x})) = \mathcal{N}(\mathbf{\widetilde{X}_nT}\mathbf{\beta},\mathbf{\epsilon}2(\mathbf{x})) $

  • i.e :

$ p \begin{smallmatrix}(\begin{bmatrix}y_0 \ y_1 \ \vdots \ y_N \end{bmatrix}|\begin{bmatrix}1 & x_{11} & x_{12} & \cdots & x_{1D} \1 & x_{21} & x_{22} & \cdots & x_{2D} \ 1 & \vdots & \vdots & \vdots & \vdots \ 1 & x_{N1} & x_{N2} & \cdots & x_{ND} \end{bmatrix}^T,(\begin{bmatrix}\beta_0 \ \beta_1 \ \cdots \ \beta_D\end{bmatrix},\begin{bmatrix}\epsilon_0 \ \epsilon_1 \ \cdots \ \ \epsilon_D\end{bmatrix} ^2)) \end{smallmatrix}$

$ = \mathcal{N} \begin{smallmatrix}(\begin{bmatrix}1 & x_{11} & x_{12} & \cdots & x_{1D} \1 & x_{21} & x_{22} & \cdots & x_{2D} \ 1 & \vdots & \vdots & \vdots & \vdots \ 1 & x_{N1} & x_{N2} & \cdots & x_{ND} \end{bmatrix} \begin{bmatrix}\beta_0 \ \beta_1 \ \cdots \ \beta_D\end{bmatrix}, \begin{bmatrix}\epsilon_0 \ \epsilon_1 \ \cdots \ \epsilon_D\end{bmatrix}^2 \end{smallmatrix} $

  • for basis function expansion:
    \phi(\mathbf{x})
    $p(y|\mathbf{x},\theta)=p(y|\mathbf{x},(\mathbf{w},\sigma^2,\phi))= \mathcal{N}(\mathbf{w^T \phi(\mathbf{x})},\mathbf{\epsilon}^2(\mathbf{x})) $

$= \mathcal{N}(\mathbf{\phi(\widetilde{X}_nT)}\mathbf{\beta},\mathbf{\epsilon}2(\mathbf{x})) $

\(\phi(\mathbf{x}) =[1, x,x^2\cdots,x^d]\)
\(\phi(\widetilde{X}_n^T) ?=[1, \widetilde{X}_n^T,(\widetilde{X}_n^T)^2\cdots,(\widetilde{X}_n^T)^d]\)


cost function

automatic way to define loss function?

  • Two desirable properties of cost functions
    1.+ve and -ve errors should be penalized equally.
    2.penalize "large" mistakes and "very large" mistakes almost equally.

  • Statistical vs computational(Convexity) tradeoff:

  • cost function \(\mathcal{L}(\beta) = \mathcal{L}(\begin{bmatrix}\beta_0 \\ \beta_1 \\ \cdots \\ \beta_D\end{bmatrix}) = \mathcal{L}(\beta_0 ,\beta_1,\cdots, \beta_D)\)

    • MAE := \(\Sigma_{n = 1,2,..N} \|y_n-f(\mathbf{x_n})_{estimate}\|\) = \(\Sigma_{n = 1,2,..N} \| \mathbf{\epsilon_n} \| = \| \epsilon_1 \| + \| \epsilon_2 \| + ... + \| \epsilon_n \|\)
    • MSE:
    • Huber loss:
    • Tufey's bisquare loss: defined interms of gradients
  • goal, find \(\beta^*\) such that \(\mathcal{L}(\beta)\) reach minimum, noticed that \(\beta \in \mathcal{R}^{D+1}\)

    • an Unconstrained Optimization Problem
      • existence of optimal solution?
      • characterizetion of optimal solution
      • algo for computing the optimal solution
    • method: Grid search; gradient descend; least square

  • extremely simple
  • works for any knid of loss
  • but high exponential computational complexity

BATCH GRADIENT DESCENT

  • one parameter: (singularle variable function optimization) \(\beta^{(k+1)} \gets \beta^{(k)} - \alpha \frac{\partial \mathcal{L}(\beta)^{(k)}}{\partial \beta}\) (stepsize \(\alpha\) 要好好选择)

  • multi-parameter:(multivariable function optimization) \(\mathbf{\beta}^{(k+1)} \gets \mathbf{\beta}^{(k)} - \alpha \frac{\partial \mathcal{L}(\mathbf{\beta})^{(k)}}{\partial \mathbf{\beta}}\)
    = $\mathbf{\beta}^{(k)} - \alpha \nabla \mathcal{L}(\mathbf{\beta})^{(k)} $

$= \begin{bmatrix}\beta_0 \ \beta_1 \ \cdots \ \beta_D\end{bmatrix}^{(k)} - \alpha \nabla \mathcal{L}(\begin{bmatrix}\beta_0 \ \beta_1 \ \cdots \ \beta_D\end{bmatrix}^{(k)}) $

\(= \begin{bmatrix}\beta_0 \\ \beta_1 \\ \cdots \\ \beta_D\end{bmatrix}^{(k)} - \alpha \nabla \mathcal{L}(\beta_0^{(k)} ,\beta_1^{(k)},\cdots, \beta_D^{(k)})\)

\(= \begin{bmatrix}\beta_0 \\ \beta_1 \\ \cdots \\ \beta_D\end{bmatrix}^{(k)} - \alpha \nabla \mathcal{L}(\beta_0^{(k)} ,\beta_1^{(k)},\cdots, \beta_D^{(k)})\)

\(= \begin{bmatrix}\beta_0^{(k)} \\ \beta_1^{(k)} \\ \cdots \\ \beta_D^{(k)}\end{bmatrix} - \alpha\begin{bmatrix}\frac{\partial \mathcal{L}(\beta_0^{(k)} ,\beta_1^{(k)},\cdots, \beta_D^{(k)})}{\partial \beta_0} \\ \frac{\partial \mathcal{L}(\beta_0^{(k)} ,\beta_1^{(k)},\cdots, \beta_D^{(k)})^{(k)}}{\partial \beta_1} \\ \frac{\partial \mathcal{L}(\beta_0^{(k)} ,\beta_1^{(k)},\cdots, \beta_D^{(k)})}{\partial \beta_2} \\ \vdots \\ \frac{\partial \mathcal{L}(\beta_0^{(k)} ,\beta_1^{(k)},\cdots, \beta_D^{(k)})}{\partial \beta_{D+1}} \end{bmatrix}\)

  • convergence of the method - analysis of gradient descend

    • when is guaranteed for the method to converge? ?: \(\mathcal{L}(\beta^{(k)})\) is continuous differentiable in the bounded set {\(\beta | \mathcal{L}(\beta) < \mathcal{L}(\beta_0)\)} -- refer the convergent theorem
      • sequence {\(\nabla \mathcal{L}(\beta^{(k)})\)}
    • convergent order , linear convergent , convergent constant see here
    • if the problem is see here quadratic optimization problem , the model can be
    • QP: minimize \(f(x) := 1/2 x^T Q x + c^T x\) which is a ellipse and Q is symmetry matrix, in this case, eigenvalue is the radius of the ellipse and the convergent constant depends very much on the ratio of the largest to the smallest eigenvalue of the Hessian matrix H(x) at the optimal soluction \(x^*\) we want the ratio to be small, that is , nearly 1
    • 从分析最简单的quadratic form problem 入手, 此时Q[A,B;C,D]对应的HESSIAN = [2A,B+C;C+B,2B]就是说,在梯度下降的时候,(x,y)的变化是沿着(\(f_x\),\(f_y\))的反方向,乘上\(\alpha\)的,梯度最大的时候(此时二次倒数为0),如果x 和 y的梯度变化率(\(f_{xx}\),\(f_{yy}\))差不多,就可以同时接近最低点,convergent rate就比较快。如果是在quadratic的模型上,就是沿着radius的方向,如果近似圆的话,convergent最快。而radius就是H的eigenvalue,相当于把Q 做了linear transform 变成H,半径不变。
    • 二次倒数就是Hessian matrix, 这个matrix是一个美丽的symmetry matrix,所以可以有很多美丽的性质,比如长得是ellipse,横截面的椭圆,positive definite, eigenvalue = radius of ellipse等等
    • 其他的实例分析起来就比较麻烦...所以有个直觉就好,就是,梯度下降是沿着eigenvalue相关的radius方向,lambda的最大值和最小值越相近越好。(如果是2元函数,只有两个lambda,就是两个的比,如果>2, 可以看成多个二元,即(x,z)(x,y)(y,z)其中限制因素就是,最大比最小的那一组)
    • ratio of largest to smallest eigenvalue = condition number of the matrix
  • ...

  • Stopping criteria: -- optimality conditions(其实就是一次导数和二次导数) -- design of gradient descend method

  1. first derivate of \(\mathcal{L}(\beta)\) which is \(\frac{\partial \mathcal{L}(\beta)^{(k)}}{\partial \beta}= 0\) or \(\nabla \mathcal{L}(\mathbf{\beta})^{(k)} = \begin{bmatrix}\frac{\partial \mathcal{L}(\mathbf{\beta})^{(k)}}{\partial \beta_0} \\ \frac{\partial \mathcal{L}(\mathbf{\beta})^{(k)}}{\partial \beta_1} \\ \frac{\partial \mathcal{L}(\mathbf{\beta})^{(k)}}{\partial \beta_2} \\ \vdots \\ \frac{\partial \mathcal{L}(\mathbf{\beta})^{(k)}}{\partial \beta_{D+1}} \end{bmatrix} = \begin{bmatrix}0 \\0 \\ \vdots \\0 \end{bmatrix}\)
  2. seconde derivate > 0 i.e Hessian is positive definite
  • Optimality conditions are still useful, in that they serve as a stopping
    criterion when they are satisfied to within a predetermined error tolerance

  • tradeoff: faster convergence ⇔ higher computational cost per iteration

  • step-size selection: \(\alpha\)

    • requirement: Convergence to a local minimum is guaranteed only when \(\alpha\) < \(\alpha_{min}\) where \(\alpha_{min}\) is a fxed constant that depends on the problem.
  • line-search method: used to set step-size automatically: backtracking


- summary of the notes :
  • the way to choose step size:
    1.Exact Line Search: \(\alpha^k\) is picked to minimize: \(min_{\alpha} f(x^k + \alpha p^k)\) => usable, not cost effective
  1. Inexact line search:
    1.backtracking (Armijo) line search: 1) \(\alpha^{(0)} = \alpha_{init}\) was given \(l=0\), 2) if \(f(x^k + a^{(l)}p^k) "<" f^k\), i) set $\alpha^{(l+1)} = t\alpha^{(l)} $ t in (0,1) is fixed and ii) increment l by 1 3)set \(\alpha^k = \alpha{(l)}\) -- prevent step size getting too small before approach the solution but it does not prevent step to be too long -- improve by tighten the requirement for "<" see the notes for modification and termination
  • beside choosing the step size, we can also choose the direction, originally we use the steepest-descent methos which is follow the gradient direction, but this way is usually have a low convergent order + numerically not convergent there are other direction can be use -- more general descent methods
  • say direction \(p^k\), gradient: \(\nabla f(\mathbf{\beta})\), the requirement for the p is 😒\nabla f(\mathbf{\beta}) ^T p^k < 0 $(in the nearly opposite direction)
  • variable metric method

  • "bisenction algorithm for a line-search of a convex function" seek to solve:
    • \(\overline{\alpha} := arg min_{\alpha} f(\overline{x} + \alpha \overline{d})\)
    • x_bar : current iterate, d_bar: curretn direction generate by an algorithm that seeks to minimize f(x) such as a descent diection of f(x) at x = x_bar
    • => let \(h(\alpha) = f(\overline{x} + \alpha \overline{d})\)
    • goal: find \(\alpha_0\) such that \(h(\alpha)\) reach minimum
    • first derivative: \(h'(\alpha) = 0\)

convexity

Outliers - Robust statistics:

  • outliers may bias the previous summary statistics, which can be solve by eliminating or downweighting the outlier values in the sample (quality control) or using statistics that are resistant to the presence of outliers
  • resistant != robust
  • robust is used in statistics to refer to insensitivity to choice of probability model or estimator rather than data value.

least square

  • least squares estimates for regression models are highly sensitive to (not robust against) outliers.

这周看machine learning简直看得天昏地暗,只睡剩下四个半小时。
简直了。原来以为看完linear algebra就差不多了后来发现machine learning = linear algebra + numerical analysis + optimization + statistic + programming + 还不知道的东西
也是醉了。
之前一直用cmd但是字数一多就各种卡所以。有空要自己build一个website然后支持markdown支持离线在线插入图片!