Logistic Regression and Gradient Descent
Logistic Regression and Gradient Descent
Logistic regression is an excellent tool to know for classification problems. Classification problems are problems where you are trying to classify observations into groups. To make our examples more concrete, we will consider the Iris dataset. The iris dataset contains 4 attributes for 3 types of iris plants. The purpose is to classify which plant you have just based on the attributes. To simplify things, we will only consider 2 attributes and 2 classes. Here are the data visually:
from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style='ticks', palette='Set2')
import pandas as pd
import numpy as np
import math
from __future__ import division
data = datasets.load_iris()
X = data.data[:100, :2]
y = data.target[:100]
X_full = data.data[:100, :]
setosa = plt.scatter(X[:50,0], X[:50,1], c='b')
versicolor = plt.scatter(X[50:,0], X[50:,1], c='r')
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.legend((setosa, versicolor), ("Setosa", "Versicolor"))
sns.despine()
Wow! This is nice - the two classes are completely separate. Now this obviously is a toy example, but let's now think about how to create a learning algorithm to give us the probability that given Sepal Width and Sepal Length the plant is Setosa. So if our algorithm returns .9 we place 90% probability on the plant being Setosa and 10% probability on it being Versicolor.
Logisitic Function
So we want to return a value between 0 and 1 to make sure we are actually representing a probability. To do this we will make use of the logistic function. The logistic function mathematically looks like this:
x_values = np.linspace(-5, 5, 100)
y_values = [1 / (1 + math.e**(-x)) for x in x_values]
plt.plot(x_values, y_values)
plt.axhline(.5)
plt.axvline(0)
sns.despine()
You can see why this is a great function for a probability measure. The y-value represents the probability and only ranges between 0 and 1. Also, for an x value of zero you get a .5 probability and as you get more positive x values you get a higher probability and more negative x values a lower probability.
Make use of your data
Okay - so this is nice, but how the heck do we use it? Well we know we have two attributes - Sepal length and Sepal width - that we need to somehow use in our logistic function. One pretty obvious thing we could do is:
Where SW is our value for sepal width and SL is our value for sepal length. For those of you familiar with Linear Regression this looks very familiar. Basically we are assuming that x is a linear combination of our data plus an intercept. For example, say we have a plant with a sepal width of 3.5 and a sepal length of 5 and some oracle tells us that β0=1, β1=2, and β2=4. This would imply:
Plugging this into our logistic function gives:
So we would give a 99% probability to a plant with those dimensions as being Setosa.
Learning
Okay - makes sense. But who is this oracle giving us our β values? Good question! This is where the learning in machine learning comes in :). We will learn our βvalues.
Step 1 - Define your cost function
If you have been around machine learning, you probably hear the phrase "cost function" thrown around. Before we get to that, though, let's do some thinking. We are trying to choose β values in order to maximize the probability of correctly classifying our plants. That is just the definition of our problem. Let's say someone did give us some β values, how would we determine if they were good values or not? We saw above how to get the probability for one example. Now imagine we did this for all our plant observations - all 100. We would now have 100 probability scores. What we would hope is that for the Setosa plants, the probability values are close to 1 and for the Versicolor plants the probability is close to 0.
But we don't care about getting the correct probability for just one observation, we want to correctly classify all our observations. If we assume our data areindependent and identically distributed, we can just take the product of all our individually calculated probabilities and that is the value we want to maximize. So in math:
The ∏ symbol means take the product for the observations classified as that plant. Here we are making use of the fact that are data are labeled, so this is called supervised learning. Also, you will notice that for Versicolor observations we are taking 1 minus the logistic function. That is because we are trying to find a value to maximize, and since Versicolor observations should have a probability close to zero, 1 minus the probability should be close to 1. So now we know that we want to maximize the following:
So we now have a value we are trying to maximize. Typically people switch this to minimization by making it negative:
Step 2 - Gradients
So now we have a value to minimize, but how do we actually find the β values that minimize our cost function? Do we just try a bunch? That doesn't seem like a good idea...
This is where convex optimization comes into play. We know that the logistic cost function is convex - just trust me on this. And since it is convex, it has a single global minimum which we can converge to using gradient descent.
Here is an image of a convex function:
from IPython.display import Image
Image(url="http://www.me.utexas.edu/~jensen/ORMM/models/unit/nonlinear/subunits/terminology/graphics/convex1.gif")
Now you can imagine, that this curve is our cost function defined above and that if we just pick a point on the curve, and then follow it down to the minimum we would eventually reach the minimum, which is our goal. Here is an animation of that. That is the idea behind gradient descent.
So the way we follow the curve is by calculating the gradients or the first derivatives of the cost function with respect to each β. So lets do some math. First realize that we can also define the cost function as:
This is because when we take the log our product becomes a sum. See log rules. And if we define yi to be 1 when the observation is Setosa and 0 when Versicolor, then we only do h(x) for Setosa and 1 - h(x) for Versicolor. So lets take the derivative of this new version of our cost function with respect to β0. Remember that our β0 is in our x value. So remember that the derivative of log(x) is 1x, so we get (for each observation):
And using the quotient rule we see that the derivative of h(x) is:
And the derivative of x with respect to β0 is just 1. Putting it all together we get:
Simplify to:
Bring in the neative and sum and we get the partial derivative with respect to β0 to be:
Now the other partial derivaties are easy. The only change is now the derivative for xi is no longer 1. For β1 it is SWi and for β2 it is SLi. So the partial derivative for β1 is:
For β2:
Step 3 - Gradient Descent
So now that we have our gradients, we can use the gradient descent algorithm to find the values for our βs that minimize our cost function. The gradient descent algorithm is very simple:
- Initially guess any values for your β values
- Repeat until converge:
- βi=βi−(α∗ gradient with respect to βi) for i=0,1,2 in our case
Here α is our learning rate. Basically how large of steps to take on our cost curve. What we are doing is taking our current β value and then subtracting some fraction of the gradient. We subtract because the gradient is the direction of greatest increase, but we want the direction of greatest decrease, so we subtract. In other words, we pick a random point on our cost curve, check to see which direction we need to go to get closer to the minimum by using the negative of the gradient, and then update our β values to move closer to the minimum. Repeat until converge means keep updating our β values until our cost value converges - or stops decreasing - meaning we have reached the minimum. Also, it is important to update all the β values at the same time. Meaning that you use the same previous β values to update all the next β values.
Gradient Descent Tricks
I think most of this are from Andrew Ng's machine learning course
- Normalize variables:
- This means for each variable subtract the mean and divide by standard deviation.
- Learning rate:
- If not converging, the learning rate needs to be smaller - but will take longer to converge
- Good values to try ..., .001, .003, .01, .03, .1, .3, 1, 3, ...
- Declare converges if cost decreases by less than 10−3 (this is just a decent suggestion)
- Plot convergence as a check
Lets see some code
Below is code that implements everything we discussed. It is vectorized, though, so things are represented as vectors and matricies. It should still be fairly clear what is going on (I hope...if not, please let me know and I can put out a version closer to the math). Also, I didn't implement an intercept (so no β0) feel free to add this if you wish :)
def logistic_func(theta, x):
return float(1) / (1 + math.e**(-x.dot(theta)))
def log_gradient(theta, x, y):
first_calc = logistic_func(theta, x) - np.squeeze(y)
final_calc = first_calc.T.dot(x)
return final_calc
def cost_func(theta, x, y):
log_func_v = logistic_func(theta,x)
y = np.squeeze(y)
step1 = y * np.log(log_func_v)
step2 = (1-y) * np.log(1 - log_func_v)
final = -step1 - step2
return np.mean(final)
def grad_desc(theta_values, X, y, lr=.001, converge_change=.001):
#normalize
X = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
#setup cost iter
cost_iter = []
cost = cost_func(theta_values, X, y)
cost_iter.append([0, cost])
change_cost = 1
i = 1
while(change_cost > converge_change):
old_cost = cost
theta_values = theta_values - (lr * log_gradient(theta_values, X, y))
cost = cost_func(theta_values, X, y)
cost_iter.append([i,