Classification and logistic regression
1. Guide
Classification: This is just like the regression problem, except that the values y we now want to predict take on only a small number of discrete values.
For now, we will focus on the binary classification problem in which y can take on only two values, 0 and 1. 0 is also called the negative class, and 1 the positive class, and they are sometimes also denoted by the symbols “-” and “+”. Given x(i), the corresponding y(i) is also called the label for the training example.
2. Logistic regression
If we use linear regression algorithm to try to predict y given x, this will perform very poorly. Intuitively, it also doesn’t make sense for h(x) to take values larger than 1 or smaller than 0 when we know that y ∈ {0, 1}.
To fix this, we will choose:
where
is called the logistic function or the sigmoid function. Here is a plot showing g(z):
Here’s a useful property of the derivative of the sigmoid function, which we write a g′:
Following how we saw least squares regression could be derived as the maximum likelihood estimator under a set of assumptions, lets endow our classification model with a set of probabilistic assumptions, and then fit the parameters via maximum likelihood.\
Let us assume that
P(y = 1 | x; θ) = hθ(x)
P(y = 0 | x; θ) = 1 − hθ(x)
---p(y | x; θ) = (h(x))y (1 − h(x))1−y
note: if θTx >= 0, then hθ(x) >= 0.5, so P(y = 1 | x; θ) >= P(y = 0 | x; θ), hence we will choose y=1.
Assuming that the m training examples were generated independently, the likelihood of θ is:
The log likelihood:
We will choose gradient ascent to maximize the log likehood:
θ := θ + α∇θl(θ).
Let's start by working with just one training example(x,y) like gradient descent:
This therefore gives us the stochastic gradient ascent rule:
If we compare this to the LMS update rule, we see that it looks identical; but this is not the same algorithm, because h(x(i)) is now defined as a non-linear function of θT x(i).