Probabilisic interpretaion

1. Guide

When faced with a regression problem, why might linear regression, and specifically why might the least-squares cost function J, be a reasonable choice? In this section, we will give a set of probabilistic assumptions, under which least-squares regression is derived as a very natural algorithm.

2. Let us assume that the target variables and the inputs are related via the equation:

　　　　　　 y⁽ⁱ⁾ = θ^T x⁽ⁱ⁾ + ε⁽ⁱ⁾,

where e⁽ⁱ⁾ is an error term that captures either unmodeled effects (such as if there are some features very pertinent to predicting housing price, but we’d left out of the regression), or random noise.

We assume ε⁽ⁱ⁾ distributed IID (independently and identically distributed), and ε⁽ⁱ⁾ ∼ N(0, σ²), the density of ε⁽ⁱ⁾ is given by

We know: ε⁽ⁱ⁾ = y⁽ⁱ⁾ - θ^T x⁽ⁱ⁾, this implies that:

The notation “p(y⁽ⁱ⁾|x⁽ⁱ⁾; θ)” indicates that this is the distribution of y⁽ⁱ⁾ given x⁽ⁱ⁾ and parameterized by θ. Note that we should not condition on θ (“p(y(i)|x(i), θ)”), since θ is not a random variable. We can also write the distribution of y(i) as as y⁽ⁱ⁾ | x⁽ⁱ⁾; θ ∼ N(θ^T x⁽ⁱ⁾, σ²).

We consider X(X is a matrix contains all the x⁽ⁱ⁾), ~y(y is a vector contains all the y⁽ⁱ⁾), we can get the likelihood function(now θ is a random variable):

Notice, ε⁽ⁱ⁾ is indepedence, so y⁽ⁱ⁾ | x⁽ⁱ⁾; θ is indepedence, and y⁽ⁱ⁾ | x⁽ⁱ⁾; θ ∼ N(θ^T x⁽ⁱ⁾, σ²)

The principal of maximum likelihood says that we should should choose θ so as to make the data as high probability as possible. I.e., we should choose θ to maximize L(θ).

In particular, the derivations will be a bit simpler if we instead maximize the log likelihood l(θ):

Hence, maximizing l(θ) gives the same answer as minimizing

which we recognize to be J(θ), our original least-squares cost function.

3. Summarize:

Under the previous probabilistic assumptions on the data, least-squares regression corresponds to finding the maximum likelihood estimate of θ. This is thus one set of assumptions under which least-squares regression can be justified as a very natural method that’s just doing maximum likelihood estimation. (Note however that the probabilistic assumptions are by no means necessary for least-squares to be a perfectly good and rational procedure, and there may—and indeed there are—other natural assumptions that can also be used to justify it.)

Note also that, in our previous discussion, our final choice of θ did not depend on what was σ², and indeed we’d have arrived at the same result even if σ² were unknown. We will use this fact again later, when we talk about the exponential family and generalized linear models.

posted on 2013-04-13 09:44 BigPalm 阅读(186) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Probabilisic interpretaion

导航

公告