人脑之战

好意是评定行为价值的绝对标准——康德

导航

Notes : <Hands-on ML with Sklearn & TF> Chapter 4

 

 

 

 

 

Go through (understanding, building, training)

 
  1. close-form : directly computes the model parameters that best fit the model to training set
  2. iterative optimization approach called Gradient Descent(GD)
  3. Ploynomial Regression
  4. Logistic Regression and Softmax Regression
 

Linear Regression

 
$$ \hat{y}={h}_{\theta }\left(x \right)={\theta }^{T}\cdot x \\ MSE(X,{h}_{\theta })=\frac{1}{m}\sum_{i=1}^{m}(\theta ^{T}\cdot x^{\left ( i \right )}-y^{\left ( i \right )})^{2} \\ The\ Normal\ Equation : \hat{\theta}=\left ( X^{T}\cdot X \right )^{-1}\cdot X^{T}\cdot y $$
In [1]:
import numpy as np
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
In [2]:
X_b = np.c_[np.ones((100, 1)), X]
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
In [3]:
theta_best
Out[3]:
array([[ 4.08630891],
       [ 2.958397  ]])
In [4]:
X_new = np.array([[0],[2]])
X_new_b = np.c_[np.ones((2, 1)),X_new]
y_predict = X_new_b.dot(theta_best)
y_predict
Out[4]:
array([[  4.08630891],
       [ 10.00310291]])
In [5]:
import matplotlib.pyplot as plt 
plt.plot(X_new, y_predict, "r-")
plt.plot(X, y, "b.")
plt.axis([0, 2, 0, 15])
plt.show()
 
In [6]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, y)
lin_reg.intercept_, lin_reg.coef_
Out[6]:
(array([ 4.08630891]), array([[ 2.958397]]))
In [7]:
lin_reg.predict(X_new)
Out[7]:
array([[  4.08630891],
       [ 10.00310291]])
 

Gradient Descent

 

It measures the local gradient of the error function with regards to the parameter vector $\theta$, and it goes in the direction of descending gradient. Once the gradient is zero, get the minimum

 
  1. start $\theta$ with a random values(random initialization)
  2. improve it gradually, take a baby step a time ($f\left ( x + \Delta x \right ) < f\left (x \right );\Delta x = -\gamma \nabla f\left ( x \right )$)
  3. Linear Regression model's MSE is a convex function which means 任选曲线两点连接线段与曲线不相较(局部最小即全局最小)
  4. the cost function can be an elongated bowl if the features have very different scales(ensure all feature have a similar scale)
  5. Train a model means searching for a combination of model parameters that minimizes a cost function(over the train set)
 

Batch Gradient Descent
$$ \frac{\partial}{\partial x}MSE\left ( \theta \right ) = \frac{2}{m}\sum_{i=1}^{m}\left ( \theta ^{T}\cdot x^{\left ( i \right )}-y^{\left ( i \right )} \right ) x_{j}^{(i)} \\ \nabla_{\theta}MSE\left(\theta \right )=\begin{pmatrix} \frac{\partial}{\partial \theta_{1}}MSE\left ( \theta \right ) \\ \frac{\partial}{\partial \theta_{2}}MSE\left ( \theta \right )\\ \cdot \cdot \cdot\\ \frac{\partial}{\partial \theta_{n}}MSE\left ( \theta \right )\\ \end{pmatrix} =\frac{2}{m} X^{T} \cdot \left(X \cdot \theta - y \right )\\ \theta^\left(next\ step \right )=\theta-\eta \nabla_{\theta}MSE(\theta) $$

In [8]:
eta = 0.1  #$\eta=0.1$
n_interations = 1000
m = 100  
theta_path_bgd = []

theta = np.random.randn(2, 1)  #random initialization

for interation in range(n_interations):
    gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
    theta = theta - eta * gradients
    theta_path_bgd.append(theta)
    
theta
Out[8]:
array([[ 4.08630891],
       [ 2.958397  ]])
 
  1. can use the grid search to find a good learning rate (limit the interation)
  2. set a very large number of iterations but to interrupt the algorithm when the gradient vector becomes tiny(tiny number called tolerance)
  3. Batch Gradient Descent with a fixed learning rate has a convergence rate of $O\left(\frac{1}{iterations} \right)$
 

Stochastic Gradient Descent -- 随机梯度下降

 
  1. Batch Gradient Descent use the whole training set to compute the gradients at every step
  2. Stochastic Gradient Descent just picks a random instance in train set at every step and compute with the signal instance
  3. cost function will bounce up and down(跳上跳下) even very close to the minimum, and the final parameter values are good but not optimal
  4. good to escape from local optima but never settle at the minimum
In [9]:
n_epochs = 50
t0, t1 = 5, 50
theta_path_sgd = []

def learning_schedule(t):
    return t0 / (t + t1)

theta = np.random.randn(2, 1)

for epoch in range(n_epochs):
    for i in range(m):   #default m=100
        random_index = np.random.randint(m)
        xi = X_b[random_index:random_index+1]
        yi = y[random_index:random_index+1]
        gradients = 2*xi.T.dot(xi.dot(theta) - yi)
        eta = learning_schedule(epoch*m + i)
        theta = theta - eta*gradients
        theta_path_sgd.append(theta)
        
theta
Out[9]:
array([[ 4.04653513],
       [ 2.95142107]])
In [10]:
from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor(n_iter=50, penalty=None, eta0=0.1)
sgd_reg.fit(X, y.ravel())
Out[10]:
SGDRegressor(alpha=0.0001, average=False, epsilon=0.1, eta0=0.1,
       fit_intercept=True, l1_ratio=0.15, learning_rate='invscaling',
       loss='squared_loss', n_iter=50, penalty=None, power_t=0.25,
       random_state=None, shuffle=True, verbose=0, warm_start=False)
In [11]:
sgd_reg.intercept_, sgd_reg.coef_
Out[11]:
(array([ 4.15137614]), array([ 3.01811746]))
 

Mini-batch Gradient Descent

 
  1. compute the gradient on small random sets of instaces called mini-batches
In [12]:
theta_path_mgd = []
t0, t1 = 10, 1000
n_interations = 10

def learning_schedule(t):
    return t0 / (t + t1)

for epoch in range(n_interations):
    for i in range(10):   #default m=100
        random_index = np.random.randint(m, size=20)
        xi = X_b[random_index]
        yi = y[random_index]
        gradients = 2*xi.T.dot(xi.dot(theta) - yi)
        eta = learning_schedule(epoch*m + i)
        theta = theta - eta*gradients
        theta_path_mgd.append(theta)
        
theta
Out[12]:
array([[ 4.07424356],
       [ 3.02989018]])
In [13]:
theta_path_bgd = np.array(theta_path_bgd)
theta_path_sgd = np.array(theta_path_sgd)
theta_path_mgd = np.array(theta_path_mgd)
In [14]:
plt.figure(figsize=(10,6))
plt.plot(theta_path_sgd[:, 0], theta_path_sgd[:, 1], "r-s", linewidth=1, label="Stochastic")
plt.plot(theta_path_mgd[:, 0], theta_path_mgd[:, 1], "g-+", linewidth=2, label="Mini-batch")
plt.plot(theta_path_bgd[:, 0], theta_path_bgd[:, 1], "b-o", linewidth=3, label="Batch")
plt.legend(loc="upper left", fontsize=16)
plt.xlabel(r"$\theta_0$", fontsize=20)
plt.ylabel(r"$\theta_1$   ", fontsize=20, rotation=0)
plt.axis([2.5, 4.5, 2.3, 3.9])
plt.show()
 
 

Ploynomial Regression

 
  1. add powers of each feature as new feature
  2. train linear model on this extended set of feature
In [15]:
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5*X**2 + X + 2 + np.random.randn(m, 1)  #添加Gauss noise
In [16]:
# 1
from sklearn.preprocessing import PolynomialFeatures

poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
print(X[0])
print(X_poly[0])
 
[ 2.17975725]
[ 2.17975725  4.75134169]
In [17]:
# 2
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)
lin_reg.intercept_, lin_reg.coef_
Out[17]:
(array([ 2.24038982]), array([[ 1.03146301,  0.43706198]]))
In [18]:
# plot
X_new=np.linspace(-3, 3, 100).reshape(100, 1)
X_new_poly = poly_features.transform(X_new)
y_new = lin_reg.predict(X_new_poly)
plt.plot(X, y, "b.")
plt.plot(X_new, y_new, "r-", linewidth=2, label="Predictions")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([-3, 3, 0, 10])
plt.legend(loc=2, fontsize=14)
plt.show()
 
 

Learning Curves

In [19]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

for style, width, degree in (("g-", 1, 300), ("b--", 2, 2), ("r-+", 2, 1)):
    polybig_features = PolynomialFeatures(degree=degree, include_bias=False)
    std_scaler = StandardScaler()
    lin_reg = LinearRegression()
    polynomial_regression = Pipeline((
            ("poly_features", polybig_features),
            ("std_scaler", std_scaler),
            ("lin_reg", lin_reg),
        ))
    polynomial_regression.fit(X, y)
    y_newbig = polynomial_regression.predict(X_new)
    plt.plot(X_new, y_newbig, style, label=str(degree), linewidth=width)

plt.plot(X, y, "b.", linewidth=3)
plt.legend(loc="upper left")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([-3, 3, 0, 10])
plt.show()
 
 

how complex you model should be

  1. cross-validation:preform on training set and generalizes
  2. look at the learning curves
In [20]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

def plot_learning_curves(model, X, y):
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
    train_errors, val_errors = [], []
    for m in range(1, len(X_train)):
        model.fit(X_train[:m], y_train[:m])
        y_train_predict = model.predict(X_train[:m])
        y_val_predict = model.predict(X_val)
        train_errors.append(mean_squared_error(y_train_predict, y_train[:m]))
        val_errors.append(mean_squared_error(y_val_predict, y_val))
        plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train")
        plt.plot(np.sqrt(val_errors), "g-", linewidth=2, label="val")
    plt.axis([0, 80, 0, 3])
    plt.show()
In [21]:
# degree=1
lin_reg = LinearRegression()
plot_learning_curves(lin_reg, X, y)
 
In [22]:
# degree=10
from sklearn.pipeline import Pipeline

polynomial_regression = Pipeline((
    ("poly_features", PolynomialFeatures(degree=10, include_bias=False)),
    ("sgd_reg", LinearRegression())
))
plot_learning_curves(polynomial_regression, X, y)
 
 
  1. underfitting : add training instance no help
  2. overfitting : can feed more training data until the validation error reaches the training error
 

generalization error

  1. Bias : wrong assumption most likely to underfitting
  2. Variance : model excessive sensitivity most likely overfitting
  3. Irreducible error : noisiness od data itself
 

Regularized Linear Model - 正则化的线性模型

 
  1. the fewer degrees of freedom the harder to overfit the data
  2. for linear model, regularization is typically archieved by constraining the weights of the model
 

Ridge Regression(岭回归)

  1. $$J(\theta ) = MSE(\theta) + \alpha \frac{1}{2} \sum_{i=1}^{n} \theta_{i}^{2}$$
  2. 正则项的添加仅仅用在训练时,性能测量时还用原来的cost function
    1. good cost function should have optimization-friendly derivatives, while the performance measure used for testing should be as close as possible to the final objective.
  3. $\alpha $ : how mach you want to regularize the model
  4. bias term $\theta_{0}$ is not regularized
  5. $$ J(\theta ) = MSE(\theta) + \frac{1}{2} \left (\left \| w \right \|_{2} \right )^{2} \\ 其中:w=\begin{pmatrix} \theta_1 \\ ...\\ \theta_n \end{pmatrix}$$
 
$$ \hat{\theta } = (X^T \cdot X + \alpha A)^{-1} \cdot X^T \cdot y $$
In [23]:
from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=1, solver='cholesky')
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])
Out[23]:
array([[ 5.1560474]])
In [24]:
sgd_reg=SGDRegressor(penalty ='l2') #SGD to add regularization term to the cost function equare of the l2 norm of the weight vector 
sgd_reg.fit(X, y.ravel())
sgd_reg.predict([[1.5]])
Out[24]:
array([ 4.27456991])
In [25]:
#figure
from sklearn.linear_model import Ridge
import numpy.random as rnd

rnd.seed(42)
m = 20
X = 3 * rnd.rand(m, 1)
y = 1 + 0.5 * X + rnd.randn(m, 1) / 1.5
X_new = np.linspace(0, 3, 100).reshape(100, 1)

def plot_model(model_class, polynomial, alphas, **model_kargs):
    for alpha, style in zip(alphas, ("b-", "g--", "r:")):
        model = model_class(alpha, **model_kargs) if alpha > 0 else LinearRegression()
        if polynomial:
            model = Pipeline((
                    ("poly_features", PolynomialFeatures(degree=10, include_bias=False)),
                    ("std_scaler", StandardScaler()),
                    ("regul_reg", model),
                ))
        model.fit(X, y)
        y_new_regul = model.predict(X_new)
        lw = 2 if alpha > 0 else 1
        plt.plot(X_new, y_new_regul, style, linewidth=lw, label=r"$\alpha = {}$".format(alpha))
    plt.plot(X, y, "b.", linewidth=3)
    plt.legend(loc="upper left", fontsize=15)
    plt.xlabel("$x_1$", fontsize=18)
    plt.axis([0, 3, 0, 4])

plt.figure(figsize=(8,4))
plt.subplot(121)  # 1 row and 2 columns subfigure 1
plot_model(Ridge, polynomial=False, alphas=(0, 10, 100))
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.subplot(122)  # 1 row and 2 columns subfigure 2
plot_model(Ridge, polynomial=True, alphas=(0, 10**-5, 1))

plt.show()
 
 

Lasso Regression(套索回归)

  1. $$J(\theta ) = MSE(\theta ) + \alpha \sum_{i=1}^{n} \left | \theta_i \right |$$
  2. use l1 norm
  3. Lasso Regression automatically performs feature selection and outputs a spare model
In [26]:
from sklearn.linear_model import Lasso

rnd.seed(42)
m = 20
X = 3 * rnd.rand(m, 1)
y = 1 + 0.5 * X + rnd.randn(m, 1) / 1.5
X_new = np.linspace(0, 3, 100).reshape(100, 1)

def plot_model(model_class, polynomial, alphas, **model_kargs):
    for alpha, style in zip(alphas, ("b-", "g--", "r:")):
        model = model_class(alpha, **model_kargs) if alpha > 0 else LinearRegression()
        if polynomial:
            model = Pipeline((
                    ("poly_features", PolynomialFeatures(degree=10, include_bias=False)),
                    ("std_scaler", StandardScaler()),
                    ("regul_reg", model),
                ))
        model.fit(X, y)
        y_new_regul = model.predict(X_new)
        lw = 2 if alpha > 0 else 1
        plt.plot(X_new, y_new_regul, style, linewidth=lw, label=r"$\alpha = {}$".format(alpha))
    plt.plot(X, y, "b.", linewidth=3)
    plt.legend(loc="upper left", fontsize=15)
    plt.xlabel("$x_1$", fontsize=18)
    plt.axis([0, 3, 0, 4])

plt.figure(figsize=(8,4))
plt.subplot(121)
plot_model(Lasso, polynomial=False, alphas=(0, 0.1, 1))
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.subplot(122)
plot_model(Lasso, polynomial=True, alphas=(0, 10**-7, 1))

plt.show()
 
/usr/local/lib/python3.5/dist-packages/sklearn/linear_model/coordinate_descent.py:484: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)
 
In [27]:
# handson-ml -- 04 -- 37-38
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

t1a, t1b, t2a, t2b = -1, 3, -1.5, 1.5

# ignoring bias term
t1s = np.linspace(t1a, t1b, 500)
t2s = np.linspace(t2a, t2b, 500)
t1, t2 = np.meshgrid(t1s, t2s)
T = np.c_[t1.ravel(), t2.ravel()]
Xr = np.array([[-1, 1], [-0.3, -1], [1, 0.1]])
yr = 2 * Xr[:, :1] + 0.5 * Xr[:, 1:]

J = (1/len(Xr) * np.sum((T.dot(Xr.T) - yr.T)**2, axis=1)).reshape(t1.shape)

N1 = np.linalg.norm(T, ord=1, axis=1).reshape(t1.shape)
N2 = np.linalg.norm(T, ord=2, axis=1).reshape(t1.shape)

t_min_idx = np.unravel_index(np.argmin(J), J.shape)
t1_min, t2_min = t1[t_min_idx], t2[t_min_idx]

t_init = np.array([[0.25], [-1]])

def bgd_path(theta, X, y, l1, l2, core = 1, eta = 0.1, n_iterations = 50):
    path = [theta]
    for iteration in range(n_iterations):
        gradients = core * 2/len(X) * X.T.dot(X.dot(theta) - y) + l1 * np.sign(theta) + 2 * l2 * theta

        theta = theta - eta * gradients
        path.append(theta)
    return np.array(path)

plt.figure(figsize=(12, 8))
for i, N, l1, l2, title in ((0, N1, 0.5, 0, "Lasso"), (1, N2, 0,  0.1, "Ridge")):
    JR = J + l1 * N1 + l2 * N2**2
    
    tr_min_idx = np.unravel_index(np.argmin(JR), JR.shape)
    t1r_min, t2r_min = t1[tr_min_idx], t2[tr_min_idx]

    levelsJ=(np.exp(np.linspace(0, 1, 20)) - 1) * (np.max(J) - np.min(J)) + np.min(J)
    levelsJR=(np.exp(np.linspace(0, 1, 20)) - 1) * (np.max(JR) - np.min(JR)) + np.min(JR)
    levelsN=np.linspace(0, np.max(N), 10)
    
    path_J = bgd_path(t_init, Xr, yr, l1=0, l2=0)
    path_JR = bgd_path(t_init, Xr, yr, l1, l2)
    path_N = bgd_path(t_init, Xr, yr, np.sign(l1)/3, np.sign(l2), core=0)

    plt.subplot(221 + i * 2)
    plt.grid(True)
    plt.axhline(y=0, color='k')
    plt.axvline(x=0, color='k')
    plt.contourf(t1, t2, J, levels=levelsJ, alpha=0.9)
    plt.contour(t1, t2, N, levels=levelsN)
    plt.plot(path_J[:, 0], path_J[:, 1], "w-o")
    plt.plot(path_N[:, 0], path_N[:, 1], "y-^")
    plt.plot(t1_min, t2_min, "rs")
    plt.title(r"$\ell_{}$ penalty".format(i + 1), fontsize=16)
    plt.axis([t1a, t1b, t2a, t2b])

    plt.subplot(222 + i * 2)
    plt.grid(True)
    plt.axhline(y=0, color='k')
    plt.axvline(x=0, color='k')
    plt.contourf(t1, t2, JR, levels=levelsJR, alpha=0.9)
    plt.plot(path_JR[:, 0], path_JR[:, 1], "w-o")
    plt.plot(t1r_min, t2r_min, "rs")
    plt.title(title, fontsize=16)
    plt.axis([t1a, t1b, t2a, t2b])

for subplot in (221, 223):
    plt.subplot(subplot)
    plt.ylabel(r"$\theta_2$", fontsize=20, rotation=0)

for subplot in (223, 224):
    plt.subplot(subplot)
    plt.xlabel(r"$\theta_1$", fontsize=20)

plt.show()
 
 
  1. $$ g\left ( \theta , J \right )=\nabla_{\theta}MSE\left ( \theta \right )+\alpha\begin{pmatrix} sign\left ( \theta_1 \right )\\ sign\left ( \theta_2 \right )\\ ...\\ sign\left ( \theta_n \right ) \end{pmatrix}\ where \ sign\left ( \theta_i \right )=\left\{\begin{matrix} -1,\ if:\theta_i<0\\ 0,\ if:\theta_i=0\\ +1,\ if:\theta_i>0 \end{matrix}\right. $$
  2. Lasso isn't differentiable at $\theta_i = 0$, but Gradient Descent still work fine use $g(\theta, J)$to instand when $\theta_i = 0$
In [28]:
from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X, y)
lasso_reg.predict([[1.5]])
Out[28]:
array([ 1.53788174])
 

Elastic Net

  1. $$J(\theta ) = MSE(\theta) + ra \sum_{i=1}^{n}\left | \theta_i \right | + \frac{1-r}{2} \alpha \sum_{i=1}^{n} \theta_{i}^{2} \ \ \ when \ r=0 \ is\ Ridge \ and \ r=1\ is\ Lasso$$
  2. Elastic Nat is preferred over Lasso since Lasso may behave erratically when the number of features is greater than the number of training instance or several features are strongly correlated
In [29]:
from sklearn.linear_model import ElasticNet
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X, y)
elastic_net.predict([[1.5]])
Out[29]:
array([ 1.54333232])
 

Early Stopping

  1. stop training as soon as the validation error reaches a minimum : beautiful free lunch
  2. not so smooth : stop only after the validation error has been above the minimum for some time
In [30]:
from sklearn.base import clone

X_train, X_val, y_train, y_val = train_test_split(X[:50], y[:50].ravel(), test_size=0.5, random_state=10)

poly_scaler = Pipeline((
        ("poly_features", PolynomialFeatures(degree=90, include_bias=False)),
        ("std_scaler", StandardScaler()),
    ))

X_train_poly_scaled = poly_scaler.fit_transform(X_train)
X_val_poly_scaled = poly_scaler.transform(X_val)

sgd_reg = SGDRegressor(n_iter=1, warm_start=True, penalty=None, learning_rate="constant", eta0=0.0005)
#warm_start=True:when fit() is called, it just continues training where it left off instead of restarting from scratch

minimum_val_error = float('inf')
best_epoch = None
best_model = None
for epoch in range(100):
    sgd_reg.fit(X_train_poly_scaled, y_train)
    y_val_predict = sgd_reg.predict(X_val_poly_scaled)
    val_error = mean_squared_error(y_val_predict, y_val)
    if val_error < minimum_val_error:
        minimum_val_error = val_error
        best_epoch = epoch
        best_model = clone(sgd_reg)
        
print(best_epoch, best_model)
 
98 SGDRegressor(alpha=0.0001, average=False, epsilon=0.1, eta0=0.0005,
       fit_intercept=True, l1_ratio=0.15, learning_rate='constant',
       loss='squared_loss', n_iter=1, penalty=None, power_t=0.25,
       random_state=None, shuffle=True, verbose=0, warm_start=True)
 

Logistic Regression

 
  1. some regression algorithms can be used for classification as well and vice versa
 

Estimating Probabilities

  1. $$\hat{p} = h_{\theta }(x) = \sigma (\theta^T \cdot x)\ \ ,\ \ \sigma(t)=\frac{1}{1+e^{(-t)}}\ \ ,\ \ \hat y=\left\{\begin{matrix} 0,\ if\ \hat p < 0.5\\ 1,\ if\ \hat p \geq 0.5 \end{matrix}\right.$$
 

Train and Cost Function

  1. $$Cost\ function\ for\ signal\ instance:c(\theta ) = \left\{\begin{matrix} -log(\hat p)\ \ \ \ if\ y=1; \\ -log(1- \hat p) \ \ if\ y=0. \end{matrix}\right.\\ Cost\ function:J(\theta )=-\frac{1}{m}\sum_{i=1}^{m}\left [ y^{(i)}log(\hat p^{(i)}) +(1-y^{(i)})log(1-\hat p^{(i)}) \right ]\\ Cost\ function\ partial\ derivatives:\frac{\partial}{\partial \theta_j}=\frac{1}{m} \sum_{i=1}^{m} \left (\sigma (\theta^T \cdot x^{(i)}) - y^{(i)} \right)x_{j}^{(i)} $$
 

Decision Boundaries

In [31]:
from sklearn import datasets
iris = datasets.load_iris()
list(iris.keys())
Out[31]:
['DESCR', 'feature_names', 'data', 'target_names', 'target']
In [32]:
X = iris["data"][:,3:]
y = (iris["target"] == 2).astype(np.int)
In [33]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(X, y)
Out[33]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [34]:
X_new = np.linspace(0, 3, 1000).reshape(-1, 1)
y_proba = log_reg.predict_proba(X_new)

plt.plot(X_new, y_proba[:, 1], 'g-', label="Iris-Vriginica")
plt.plot(X_new, y_proba[:, 0], 'b--', label="Not Iris-Vriginica")

plt.show()
 
In [35]:
log_reg.predict([[1.7],[1.5]])
Out[35]:
array([1, 0])
In [36]:
from sklearn.linear_model import LogisticRegression

X = iris["data"][:, (2, 3)]  # petal length, petal width
y = (iris["target"] == 2).astype(np.int)

log_reg = LogisticRegression(C=10**10)
log_reg.fit(X, y)

x0, x1 = np.meshgrid(
        np.linspace(2.9, 7, 500).reshape(-1, 1),
        np.linspace(0.8, 2.7, 200).reshape(-1, 1),
    )
X_new = np.c_[x0.ravel(), x1.ravel()]

y_proba = log_reg.predict_proba(X_new)

plt.figure(figsize=(10, 4))
plt.plot(X[y==0, 0], X[y==0, 1], "bs")
plt.plot(X[y==1, 0], X[y==1, 1], "g^")

zz = y_proba[:, 1].reshape(x0.shape)
contour = plt.contour(x0, x1, zz, cmap=plt.cm.brg)


left_right = np.array([2.9, 7])
boundary = -(log_reg.coef_[0][0] * left_right + log_reg.intercept_[0]) / log_reg.coef_[0][1]

plt.clabel(contour, inline=1, fontsize=12)
plt.plot(left_right, boundary, "k--", linewidth=3)
plt.text(3.5, 1.5, "Not Iris-Virginica", fontsize=14, color="b", ha="center")
plt.text(6.5, 2.3, "Iris-Virginica", fontsize=14, color="g", ha="center")
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.axis([2.9, 7, 0.8, 2.7])
plt.show()
 
 
  1. can be regularized using $l_1$ or $l_2$ penalties
  2. the hyperparameter controlling the regularized strength is not alpha but its inverse C
 

SoftMax Regression

  1. Logistic Regression can be generalized to support multiple classes directly
  2. SoftMax model first compute the secore $s_k(x)$, then estimates the probability of each class by applying the softmax function
  3. $$ SoftMax\ Secore\ :\ s_k(x)=\theta_{k}^{T} \cdot x \\ \Theta = \begin{pmatrix} \theta_1\\ \theta_2\\ ...\\ \theta_K \end{pmatrix} \\ SoftMax\ function\ :\ \hat p_k = \sigma (s(x))_k = \frac{e^{s(x)}}{\sum_{j=1}^{K} e^{s_j(x)}} $$
  4. $$SoftMax\ classifier\ prediction\ :\ \hat y = \underset{x}{\operatorname{argmax}} \sigma (s(x))_k=\underset{x}{\operatorname{argmax}} s_k(x)=\underset{x}{\operatorname{argmax}} (\theta_{k}^{T} \cdot x) $$
  5. predict only one class at one time
 

Cross entropy

  1. $$ J(\Theta ) = - \frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K} y_{k}^{(i)} log(\hat p_{k}^{(i)})\ ,\ y_{k}^{(i)} =1,if\ target\ class\ is\ 1^th $$
  2. $$ K=2\ :\ Cross\ entropy=Cost\ function:J(\theta )=-\frac{1}{m}\sum_{i=1}^{m}\left [ y^{(i)}log(\hat p^{(i)}) +(1-y^{(i)})log(1-\hat p^{(i)}) \right ]\\ $$
  3. $$ 两个分布p和q的交叉熵:H(p,q)=-\sum_{x} p(x)logq(x) $$
  4. $$ Cross\ entropy\ gradient\ vector\ for\ class\ k\ :\ \nabla_{\theta_k}J(\Theta)=\frac{1}{m}\sum_{i=1}^{m} \left (\hat p_{k}^{(i)}-y_{k}^{(i)}\right)x^{(i)} $$
In [37]:
X = iris['data'][:, (2, 3)]
y = iris['target']

softmax_reg = LogisticRegression(multi_class='multinomial', solver='lbfgs', C=10)
softmax_reg.fit(X, y)

softmax_reg.predict([[5,2]])
softmax_reg.predict_proba([[5, 2]])
Out[37]:
array([[  6.33134078e-07,   5.75276067e-02,   9.42471760e-01]])
In [38]:
x0, x1 = np.meshgrid(
        np.linspace(0, 8, 500).reshape(-1, 1),
        np.linspace(0, 3.5, 200).reshape(-1, 1),
    )
X_new = np.c_[x0.ravel(), x1.ravel()]


y_proba = softmax_reg.predict_proba(X_new)
y_predict = softmax_reg.predict(X_new)

zz1 = y_proba[:, 1].reshape(x0.shape)
zz = y_predict.reshape(x0.shape)

plt.figure(figsize=(10, 4))
plt.plot(X[y==2, 0], X[y==2, 1], "g^", label="Iris-Virginica")
plt.plot(X[y==1, 0], X[y==1, 1], "bs", label="Iris-Versicolor")
plt.plot(X[y==0, 0], X[y==0, 1], "yo", label="Iris-Setosa")

from matplotlib.colors import ListedColormap
custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])

plt.contourf(x0, x1, zz, cmap=custom_cmap, linewidth=5)
contour = plt.contour(x0, x1, zz1, cmap=plt.cm.brg)
plt.clabel(contour, inline=1, fontsize=12)
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.legend(loc="center left", fontsize=14)
plt.axis([0, 7, 0, 3.5])
plt.show()
 
In [ ]:
 

posted on 2017-06-05 11:00  人脑之战  阅读(564)  评论(0编辑  收藏  举报