Classification and logistic regression

1. Guide

  Classification: This is just like the regression problem, except that the values y we now want to predict take on only a small number of discrete values.

  For now, we will focus on the binary classification problem in which y can take on only two values, 0 and 1. 0 is also called the negative class, and 1 the positive class, and they are sometimes also denoted by the symbols “-” and “+”.  Given x(i), the corresponding y(i) is also called the label for the training example.

 

2. Logistic regression

  If we use linear regression algorithm to try to predict y given x, this will perform very poorly. Intuitively, it also doesn’t make sense for h(x) to take values larger than 1 or smaller than 0 when we know that y ∈ {0, 1}.

  To fix this, we will choose:

                          

  where

           

  is called the logistic function or the sigmoid function. Here is a plot showing g(z): 

        

  Here’s a useful property of the derivative of the sigmoid function, which we write a g′:

           

  Following how we saw least squares regression could be derived as the maximum likelihood estimator under a set of assumptions, lets endow our classification model with a set of probabilistic assumptions, and then fit the parameters via maximum likelihood.\

  Let us assume that

    P(y = 1 | x; θ) = hθ(x)

    P(y = 0 | x; θ) = 1 − hθ(x)

    ---p(y | x; θ) = (h(x))y (1 − h(x))1−y

  note: if θTx >= 0, then hθ(x) >= 0.5, so P(y = 1 | x; θ) >= P(y = 0 | x; θ), hence we will choose y=1.

  Assuming that the m training examples were generated independently, the likelihood of θ is:

                             

  The log likelihood:

            

  We will choose gradient ascent to maximize the log likehood:

    θ := θ + α∇θl(θ).

  Let's start by working with just one training example(x,y) like gradient descent:

                          

  This therefore gives us the stochastic gradient ascent rule:

                    

  If we compare this to the LMS update rule, we see that it looks identical; but this is not the same algorithm, because h(x(i)) is now defined as a non-linear function of θT x(i).

          

  

  

  

 

posted on 2013-04-13 11:32  BigPalm  阅读(377)  评论(0编辑  收藏  举报

导航