Linear Regression

1. Guide

Here, the x’s are two-dimensional vectors in R².

For instance, x⁽ⁱ⁾₁ is the living area of the i-th house in the training set, and x⁽ⁱ⁾₂ is its number of bedrooms.

To perform supervised learning, wo must decide to choose h. As an initual choice, we decide to approximate y as a linear function of x:

h_θ(x) = θ₀ + θ₁x₁ + θ₂x₂

Here, the θ_i’s are the parameters (also called weights) parameterizing the space of linear functions mapping from X to Y.

We always let h_θ(x) as h(x), and x₀= 1, so that

Here, θ and x are both n + 1 demensional vectors. n is the number of input variables (not counting x₀).

How to pick or learn the parameters θ?

---make h(x) close to θ, at least for the training examples we have.

We define the cost function to formalize this:

---this function is also called the least-squares cost function J.

2. LMS algorithm

We want to choose θ so as to minimize J(θ). To do so, lets use a search algorithm that starts with some “initial guess” for θ, and that repeatedly changes θ to make J(θ) smaller, until hopefully we converge to a value of θ that minimizes J(θ). Specifically, lets consider the gradient descent algorithm, which starts with some initial θ, and repeatedly performs the update:

Here, j = 0, 1,..., n. α is called the learning rate. So every time we use θ^(k), x⁽ⁱ⁾ and y⁽ⁱ⁾ (i=1,2...,m) to update θ^(k+1).

First, if we have only one training example(x,y)(m = 1), we'll have

For a single training example, this gives the update rule:

The rule is called the LMS update rule (LMS stands for “least mean squares”).

3. Batch gradient descent and stochastic gradient descent

Batch gradient descent:

This method looks at every example in the entire training set on every step, we update the parameters according to the gradient of the error with respect to all the training examples.

Note: gradient descent can be susceptible to local minima in general, but J is a convex quadratic function, so here exists only one globe minimum, no other local minima, thus gradient descent always converges (assuming the learning rate α is not too large) to the global minimum.

Stochastic gradient descent:

In this algorithm, we repeatedly run through the training set, and each time we encounter a training example, we update the parameters according to the gradient of the error with respect to that single training example only.

Compare: Whereas batch gradient descent has to scan through the entire training set before taking a single step—a costly operation if m is large—stochastic gradient descent can start making progress right away, and continues to make progress with each example it looks at. Often, stochastic gradient descent gets θ “close” to the minimum much faster than batch gradient descent. (Note however that it may never “converge” to the minimum, and the parameters θ will keep oscillating around the minimum of J(θ); but in practice most of the values near the minimum will be reasonably good approximations to the true minimum) For these reasons, particularly when the training set is large, stochastic gradient descent is often preferred over batch gradient descent.

Note:

　　a. Batch gradient descent and stochastic gradient descent are both update synchronously.(θ^(k+1) base on θ^(k))

　　b. While it is more common to run stochastic gradient descent as we have described it and with a fixed learning rate , by slowly letting the learning rate decrease to zero as the algorithm runs, it is also possible to ensure that the parameters will converge to the global minimum rather then merely oscillate around the minimum.

posted on 2013-04-12 09:46 BigPalm 阅读(186) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Linear Regression

导航

公告