Logistic regression is an excellent tool to know for classification problems. Classification problems are problems where you are trying to classify observations into groups. To make our examples more concrete, we will consider the Iris dataset. The iris dataset contains 4 attributes for 3 types of iris plants. The purpose is to classify which plant you have just based on the attributes. To simplify things, we will only consider 2 attributes and 2 classes. Here are the data visually:

In [79]:

from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style='ticks', palette='Set2')
import pandas as pd
import numpy as np
import math
from __future__ import division

data = datasets.load_iris()
X = data.data[:100, :2]
y = data.target[:100]
X_full = data.data[:100, :]

setosa = plt.scatter(X[:50,0], X[:50,1], c='b')
versicolor = plt.scatter(X[50:,0], X[50:,1], c='r')
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.legend((setosa, versicolor), ("Setosa", "Versicolor"))
sns.despine()

Wow! This is nice - the two classes are completely separate. Now this obviously is a toy example, but let's now think about how to create a learning algorithm to give us the probability that given Sepal Width and Sepal Length the plant is Setosa. So if our algorithm returns .9 we place 90% probability on the plant being Setosa and 10% probability on it being Versicolor.

Logisitic Function

So we want to return a value between 0 and 1 to make sure we are actually representing a probability. To do this we will make use of the logistic function. The logistic function mathematically looks like this:

y = 1 1 + e - x

Let's take a look at the plot:

In [35]:

x_values = np.linspace(-5, 5, 100)
y_values = [1 / (1 + math.e**(-x)) for x in x_values]
plt.plot(x_values, y_values)
plt.axhline(.5)
plt.axvline(0)
sns.despine()

You can see why this is a great function for a probability measure. The y-value represents the probability and only ranges between 0 and 1. Also, for an x value of zero you get a .5 probability and as you get more positive x values you get a higher probability and more negative x values a lower probability.

Make use of your data

Okay - so this is nice, but how the heck do we use it? Well we know we have two attributes - Sepal length and Sepal width - that we need to somehow use in our logistic function. One pretty obvious thing we could do is:

x = β 0 + β 1 S W + β 2 S L

Where SW is our value for sepal width and SL is our value for sepal length. For those of you familiar with Linear Regression this looks very familiar. Basically we are assuming that x is a linear combination of our data plus an intercept. For example, say we have a plant with a sepal width of 3.5 and a sepal length of 5 and some oracle tells us that

x = 1 + (2 * 3.5) + (4 * 5) = 28

Plugging this into our logistic function gives:

1 1 + e - 28 = .99

So we would give a 99% probability to a plant with those dimensions as being Setosa.

Learning

Okay - makes sense. But who is this oracle giving us our

Step 1 - Define your cost function

If you have been around machine learning, you probably hear the phrase "cost function" thrown around. Before we get to that, though, let's do some thinking. We are trying to choose

But we don't care about getting the correct probability for just one observation, we want to correctly classify all our observations. If we assume our data areindependent and identically distributed, we can just take the product of all our individually calculated probabilities and that is the value we want to maximize. So in math:

\prod S e t o s a 1 1 + e - ( β 0 + β 1 S W + β 2 S L ) \prod

h (x) = 1 1 + e - x

x = β 0 + β 1 S W + β 2 S L

\prod S e t o s a h (x) \prod V e r s i c o l o r 1 - h (x)

The

\prod S e t o s a h (x) \prod V e r s i c o l o r 1 - h (x)

So we now have a value we are trying to maximize. Typically people switch this to minimization by making it negative:

- \prod S e t o s a h (x) \prod V e r s i c o l o r 1 - h (x)

Step 2 - Gradients

So now we have a value to minimize, but how do we actually find the

This is where convex optimization comes into play. We know that the logistic cost function is convex - just trust me on this. And since it is convex, it has a single global minimum which we can converge to using gradient descent.

Here is an image of a convex function:

In [31]:

from IPython.display import Image
Image(url="http://www.me.utexas.edu/~jensen/ORMM/models/unit/nonlinear/subunits/terminology/graphics/convex1.gif")

Out[31]:

Now you can imagine, that this curve is our cost function defined above and that if we just pick a point on the curve, and then follow it down to the minimum we would eventually reach the minimum, which is our goal. Here is an animation of that. That is the idea behind gradient descent.

So the way we follow the curve is by calculating the gradients or the first derivatives of the cost function with respect to each

- \sum i = 1 100 y i l o g (h (x i)) + (1 - y i) l o g (1 - h (

This is because when we take the log our product becomes a sum. See log rules. And if we define

y i h ( x i ) + 1 - y i 1 - h ( x i )

And using the quotient rule we see that the derivative of h(x) is:

e - x ( 1 + e - x ) 2 = 1 1 + e - x ( 1 - 1 1 + e

And the derivative of x with respect to

y i h ( x i ) ( 1 - h ( x i ) ) h ( x i ) - ( 1 - y

Simplify to:

y i (1 - h (x i)) - (1 - y i) h (x i) = y i - y i

Bring in the neative and sum and we get the partial derivative with respect to

\sum i = 1 100 h (x i) - y i

Now the other partial derivaties are easy. The only change is now the derivative for

\sum i = 1 100 (h (x i) - y i) S W i

For

\sum i = 1 100 (h (x i) - y i) S L i

Step 3 - Gradient Descent

So now that we have our gradients, we can use the gradient descent algorithm to find the values for our

Initially guess any values for your
Repeat until converge:

Here

Gradient Descent Tricks

I think most of this are from Andrew Ng's machine learning course

Normalize variables:
- This means for each variable subtract the mean and divide by standard deviation.
Learning rate:
- If not converging, the learning rate needs to be smaller - but will take longer to converge
- Good values to try ..., .001, .003, .01, .03, .1, .3, 1, 3, ...
Declare converges if cost decreases by less than
Plot convergence as a check

Lets see some code

Below is code that implements everything we discussed. It is vectorized, though, so things are represented as vectors and matricies. It should still be fairly clear what is going on (I hope...if not, please let me know and I can put out a version closer to the math). Also, I didn't implement an intercept (so no

In [37]:

def logistic_func(theta, x):
    return float(1) / (1 + math.e**(-x.dot(theta)))
def log_gradient(theta, x, y):
    first_calc = logistic_func(theta, x) - np.squeeze(y)
    final_calc = first_calc.T.dot(x)
    return final_calc
def cost_func(theta, x, y):
    log_func_v = logistic_func(theta,x)
    y = np.squeeze(y)
    step1 = y * np.log(log_func_v)
    step2 = (1-y) * np.log(1 - log_func_v)
    final = -step1 - step2
    return np.mean(final)
def grad_desc(theta_values, X, y, lr=.001, converge_change=.001):
    #normalize
    X = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
    #setup cost iter
    cost_iter = []
    cost = cost_func(theta_values, X, y)
    cost_iter.append([0, cost])
    change_cost = 1
    i = 1
    while(change_cost > converge_change):
        old_cost = cost
        theta_values = theta_values - (lr * log_gradient(theta_values, X, y))
        cost = cost_func(theta_values, X, y)
        cost_iter.append([i, cost])
        change_cost = old_cost - cost
        i+=1
    return theta_values, np.array(cost_iter)
def pred_values(theta, X, hard=True):
    #normalize
    X = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
    pred_prob = logistic_func(theta, X)
    pred_value = np.where(pred_prob >= .5, 1, 0)
    if hard:
        return pred_value
    return pred_prob

Put it to the test

So here I will use the above code for our toy example. I initalize our

In [54]:

shape = X.shape[1]
y_flip = np.logical_not(y) #flip Setosa to be 1 and Versicolor to zero to be consistent
betas = np.zeros(shape)
fitted_values, cost_iter = grad_desc(betas, X, y_flip)
print(fitted_values)

[-1.52645347  1.39922382]

So I get a value of -1.5 for

Now let's make some predictions (Note: since we are returning a probability, if the probability is greater than or equal to 50% then I assign the value to Setosa - or a value of 1):

In [56]:

predicted_y = pred_values(fitted_values, X)
predicted_y

Out[56]:

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0])

And let's see how accurate we are:

In [70]:

np.sum(y_flip == predicted_y)

Out[70]:

Cool - we got all but 1 right. So that is pretty good. But again note: this is a very simple example, where getting all correct is actually pretty easy and we are looking at training accuracy. But that is not the point - we just want to make sure our algorithm is working.

We can do another check by taking a look at how our gradient descent converged:

In [99]:

plt.plot(cost_iter[:,0], cost_iter[:,1])
plt.ylabel("Cost")
plt.xlabel("Iteration")
sns.despine()

You can see that as we ran our algorithm, we continued to decrease our cost function and we stopped right at about when we see the decrease in cost to level out. Nice - everything seems to be working!

Lastly, another nice check is to see how well a packaged version of the algorithm does:

In [66]:

from sklearn import linear_model
logreg = linear_model.LogisticRegression()
logreg.fit(X, y_flip)
sum(y_flip == logreg.predict(X))

Out[66]:

Cool - they also get 99 / 100 correct. Looking good :)

Advanced Optimization

So gradient descent is one way to learn our

BFGS
- http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.fmin_bfgs.html
L-BFGS: Like BFGS but uses limited memory
- http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fmin_l_bfgs_b.html
Conjugate Gradient
- http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fmin_cg.html

Here are the very high level advantages / disadvantages of using one of these algorithms over gradient descent:

Advantages
- Don't need to pick learning rate
- Often run faster (not always the case)
- Can numerically approximate gradient for you (doesn't always work out well)
Disadvantages
- More complex
- More of a black box unless you learn the specifics

The one I hear most about these days is L-BFGS, so I will use it as my example. To use the others, all you do is replace the scipy function with the one in the links above. All the arguments remain the same. Also, I will now use all 4 features as opposed to just 2.

L-BFGS

In [89]:

from scipy.optimize import fmin_l_bfgs_b
#normalize data
norm_X = (X_full - np.mean(X_full, axis=0)) / np.std(X_full, axis=0)
myargs = (norm_X, y_flip)
betas = np.zeros(norm_X.shape[1])
lbfgs_fitted = fmin_l_bfgs_b(cost_func, x0=betas, args=myargs, fprime=log_gradient)
lbfgs_fitted[0]

Out[89]:

array([ -1.39630462,   5.3512917 ,  -9.41860088, -10.84876254])

Above are the

In [90]:

lbfgs_predicted = pred_values(lbfgs_fitted[0], norm_X, hard=True)
sum(lbfgs_predicted == y_flip)

Out[90]:

A perfect 100 - not bad.

Compare with Scikit-Learn

In [94]:

from sklearn import linear_model
logreg = linear_model.LogisticRegression()
logreg.fit(norm_X, y_flip)
sum(y_flip == logreg.predict(norm_X))

Out[94]:

Compare with our implementation

In [98]:

fitted_values, cost_iter = grad_desc(betas, norm_X, y_flip)
predicted_y = pred_values(fitted_values, norm_X)
sum(predicted_y == y_flip)

Out[98]:

So with all 4 features we all get a perfect accuracy, which is to be expected given that the classes are linearlly seperable. So no surprise here, but it is nice to know things are working :). Note: This example doesn't really let L-BFGS shine. The purpose of this post, though, isn't to evaluate advanced optimization techniques. If this is your interest try running some tests with much larger data with many more features and less seperable classes.

Conclusion

I hope this little tutorial helped you understand in some depth logistic regression. It is a powerful tool that is good to know. It can even become more powerful with things like regularization.

Even more so, I hope this helped explain the steps of how a learning algorithm might be designed. Having a grasp on what a cost function is and how to minimize it with techniques such as gradient descent can really help understand some of the machine learning literature.

Anyway - if you have any questions or comments. I would love to hear them!

菜鸡一枚

Logistic Regression and Gradient Descent