Go through (understanding, building, training)¶

close-form : directly computes the model parameters that best fit the model to training set
iterative optimization approach called Gradient Descent(GD)
Ploynomial Regression
Logistic Regression and Softmax Regression

Linear Regression¶

$\hat{y}={h}_{\theta }\left(x \right)={\theta }^{T}\cdot x \\ MSE(X,{h}_{\theta })=\frac{1}{m}\sum_{i=1}^{m}(\theta ^{T}\cdot x^{\left ( i \right )}-y^{\left ( i \right )})^{2} \\ The\ Normal\ Equation : \hat{\theta}=\left ( X^{T}\cdot X \right )^{-1}\cdot X^{T}\cdot y$

import numpy as np
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

X_b = np.c_[np.ones((100, 1)), X]
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

theta_best

array([[ 4.08630891],
       [ 2.958397  ]])

X_new = np.array([[0],[2]])
X_new_b = np.c_[np.ones((2, 1)),X_new]
y_predict = X_new_b.dot(theta_best)
y_predict

array([[  4.08630891],
       [ 10.00310291]])

import matplotlib.pyplot as plt 
plt.plot(X_new, y_predict, "r-")
plt.plot(X, y, "b.")
plt.axis([0, 2, 0, 15])
plt.show()

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, y)
lin_reg.intercept_, lin_reg.coef_

(array([ 4.08630891]), array([[ 2.958397]]))

lin_reg.predict(X_new)

array([[  4.08630891],
       [ 10.00310291]])

Gradient Descent¶

It measures the local gradient of the error function with regards to the parameter vector $\theta$ , and it goes in the direction of descending gradient. Once the gradient is zero, get the minimum

start $\theta$ with a random values(random initialization)
improve it gradually, take a baby step a time ( $f\left ( x + \Delta x \right ) < f\left (x \right );\Delta x = -\gamma \nabla f\left ( x \right )$ )
Linear Regression model's MSE is a convex function which means 任选曲线两点连接线段与曲线不相较(局部最小即全局最小)
the cost function can be an elongated bowl if the features have very different scales(ensure all feature have a similar scale)
Train a model means searching for a combination of model parameters that minimizes a cost function(over the train set)

Batch Gradient Descent
$\frac{\partial}{\partial x}MSE\left ( \theta \right ) = \frac{2}{m}\sum_{i=1}^{m}\left ( \theta ^{T}\cdot x^{\left ( i \right )}-y^{\left ( i \right )} \right ) x_{j}^{(i)} \\ \nabla_{\theta}MSE\left(\theta \right )=\begin{pmatrix} \frac{\partial}{\partial \theta_{1}}MSE\left ( \theta \right ) \\ \frac{\partial}{\partial \theta_{2}}MSE\left ( \theta \right )\\ \cdot \cdot \cdot\\ \frac{\partial}{\partial \theta_{n}}MSE\left ( \theta \right )\\ \end{pmatrix} =\frac{2}{m} X^{T} \cdot \left(X \cdot \theta - y \right )\\ \theta^\left(next\ step \right )=\theta-\eta \nabla_{\theta}MSE(\theta)$

eta = 0.1  #$\eta=0.1$
n_interations = 1000
m = 100  
theta_path_bgd = []

theta = np.random.randn(2, 1)  #random initialization

for interation in range(n_interations):
    gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
    theta = theta - eta * gradients
    theta_path_bgd.append(theta)
    
theta

array([[ 4.08630891],
       [ 2.958397  ]])

can use the grid search to find a good learning rate (limit the interation)
set a very large number of iterations but to interrupt the algorithm when the gradient vector becomes tiny(tiny number called tolerance)
Batch Gradient Descent with a fixed learning rate has a convergence rate of $O\left(\frac{1}{iterations} \right)$

Stochastic Gradient Descent -- 随机梯度下降

Batch Gradient Descent use the whole training set to compute the gradients at every step
Stochastic Gradient Descent just picks a random instance in train set at every step and compute with the signal instance
cost function will bounce up and down(跳上跳下) even very close to the minimum, and the final parameter values are good but not optimal
good to escape from local optima but never settle at the minimum

n_epochs = 50
t0, t1 = 5, 50
theta_path_sgd = []

def learning_schedule(t):
    return t0 / (t + t1)

theta = np.random.randn(2, 1)

for epoch in range(n_epochs):
    for i in range(m):   #default m=100
        random_index = np.random.randint(m)
        xi = X_b[random_index:random_index+1]
        yi = y[random_index:random_index+1]
        gradients = 2*xi.T.dot(xi.dot(theta) - yi)
        eta = learning_schedule(epoch*m + i)
        theta = theta - eta*gradients
        theta_path_sgd.append(theta)
        
theta

array([[ 4.04653513],
       [ 2.95142107]])

from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor(n_iter=50, penalty=None, eta0=0.1)
sgd_reg.fit(X, y.ravel())

SGDRegressor(alpha=0.0001, average=False, epsilon=0.1, eta0=0.1,
       fit_intercept=True, l1_ratio=0.15, learning_rate='invscaling',
       loss='squared_loss', n_iter=50, penalty=None, power_t=0.25,
       random_state=None, shuffle=True, verbose=0, warm_start=False)

sgd_reg.intercept_, sgd_reg.coef_

(array([ 4.15137614]), array([ 3.01811746]))

Mini-batch Gradient Descent

compute the gradient on small random sets of instaces called mini-batches

theta_path_mgd = []
t0, t1 = 10, 1000
n_interations = 10

def learning_schedule(t):
    return t0 / (t + t1)

for epoch in range(n_interations):
    for i in range(10):   #default m=100
        random_index = np.random.randint(m, size=20)
        xi = X_b[random_index]
        yi = y[random_index]
        gradients = 2*xi.T.dot(xi.dot(theta) - yi)
        eta = learning_schedule(epoch*m + i)
        theta = theta - eta*gradients
        theta_path_mgd.append(theta)
        
theta

array([[ 4.07424356],
       [ 3.02989018]])

theta_path_bgd = np.array(theta_path_bgd)
theta_path_sgd = np.array(theta_path_sgd)
theta_path_mgd = np.array(theta_path_mgd)

plt.figure(figsize=(10,6))
plt.plot(theta_path_sgd[:, 0], theta_path_sgd[:, 1], "r-s", linewidth=1, label="Stochastic")
plt.plot(theta_path_mgd[:, 0], theta_path_mgd[:, 1], "g-+", linewidth=2, label="Mini-batch")
plt.plot(theta_path_bgd[:, 0], theta_path_bgd[:, 1], "b-o", linewidth=3, label="Batch")
plt.legend(loc="upper left", fontsize=16)
plt.xlabel(r"$\theta_0$", fontsize=20)
plt.ylabel(r"$\theta_1$   ", fontsize=20, rotation=0)
plt.axis([2.5, 4.5, 2.3, 3.9])
plt.show()

Ploynomial Regression¶

add powers of each feature as new feature
train linear model on this extended set of feature

m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5*X**2 + X + 2 + np.random.randn(m, 1)  #添加Gauss noise

# 1
from sklearn.preprocessing import PolynomialFeatures

poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
print(X[0])
print(X_poly[0])

[ 2.17975725]
[ 2.17975725  4.75134169]

# 2
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)
lin_reg.intercept_, lin_reg.coef_

(array([ 2.24038982]), array([[ 1.03146301,  0.43706198]]))

# plot
X_new=np.linspace(-3, 3, 100).reshape(100, 1)
X_new_poly = poly_features.transform(X_new)
y_new = lin_reg.predict(X_new_poly)
plt.plot(X, y, "b.")
plt.plot(X_new, y_new, "r-", linewidth=2, label="Predictions")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([-3, 3, 0, 10])
plt.legend(loc=2, fontsize=14)
plt.show()

Learning Curves¶

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

for style, width, degree in (("g-", 1, 300), ("b--", 2, 2), ("r-+", 2, 1)):
    polybig_features = PolynomialFeatures(degree=degree, include_bias=False)
    std_scaler = StandardScaler()
    lin_reg = LinearRegression()
    polynomial_regression = Pipeline((
            ("poly_features", polybig_features),
            ("std_scaler", std_scaler),
            ("lin_reg", lin_reg),
        ))
    polynomial_regression.fit(X, y)
    y_newbig = polynomial_regression.predict(X_new)
    plt.plot(X_new, y_newbig, style, label=str(degree), linewidth=width)

plt.plot(X, y, "b.", linewidth=3)
plt.legend(loc="upper left")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([-3, 3, 0, 10])
plt.show()

how complex you model should be

cross-validation:preform on training set and generalizes
look at the learning curves

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

def plot_learning_curves(model, X, y):
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
    train_errors, val_errors = [], []
    for m in range(1, len(X_train)):
        model.fit(X_train[:m], y_train[:m])
        y_train_predict = model.predict(X_train[:m])
        y_val_predict = model.predict(X_val)
        train_errors.append(mean_squared_error(y_train_predict, y_train[:m]))
        val_errors.append(mean_squared_error(y_val_predict, y_val))
        plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train")
        plt.plot(np.sqrt(val_errors), "g-", linewidth=2, label="val")
    plt.axis([0, 80, 0, 3])
    plt.show()

# degree=1
lin_reg = LinearRegression()
plot_learning_curves(lin_reg, X, y)

# degree=10
from sklearn.pipeline import Pipeline

polynomial_regression = Pipeline((
    ("poly_features", PolynomialFeatures(degree=10, include_bias=False)),
    ("sgd_reg", LinearRegression())
))
plot_learning_curves(polynomial_regression, X, y)

underfitting : add training instance no help
overfitting : can feed more training data until the validation error reaches the training error

generalization error

Bias : wrong assumption most likely to underfitting
Variance : model excessive sensitivity most likely overfitting
Irreducible error : noisiness od data itself

Regularized Linear Model - 正则化的线性模型¶

the fewer degrees of freedom the harder to overfit the data
for linear model, regularization is typically archieved by constraining the weights of the model

Ridge Regression(岭回归)

$J(\theta ) = MSE(\theta) + \alpha \frac{1}{2} \sum_{i=1}^{n} \theta_{i}^{2}$
正则项的添加仅仅用在训练时，性能测量时还用原来的cost function
1. good cost function should have optimization-friendly derivatives, while the performance measure used for testing should be as close as possible to the final objective.
$\alpha$ ： how mach you want to regularize the model
bias term $\theta_{0}$ is not regularized
$J(\theta ) = MSE(\theta) + \frac{1}{2} \left (\left \| w \right \|_{2} \right )^{2} \\ 其中：w=\begin{pmatrix} \theta_1 \\ ...\\ \theta_n \end{pmatrix}$

$\hat{\theta } = (X^T \cdot X + \alpha A)^{-1} \cdot X^T \cdot y$

from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=1, solver='cholesky')
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])

array([[ 5.1560474]])

sgd_reg=SGDRegressor(penalty ='l2') #SGD to add regularization term to the cost function equare of the l2 norm of the weight vector 
sgd_reg.fit(X, y.ravel())
sgd_reg.predict([[1.5]])

array([ 4.27456991])

#figure
from sklearn.linear_model import Ridge
import numpy.random as rnd

rnd.seed(42)
m = 20
X = 3 * rnd.rand(m, 1)
y = 1 + 0.5 * X + rnd.randn(m, 1) / 1.5
X_new = np.linspace(0, 3, 100).reshape(100, 1)

def plot_model(model_class, polynomial, alphas, **model_kargs):
    for alpha, style in zip(alphas, ("b-", "g--", "r:")):
        model = model_class(alpha, **model_kargs) if alpha > 0 else LinearRegression()
        if polynomial:
            model = Pipeline((
                    ("poly_features", PolynomialFeatures(degree=10, include_bias=False)),
                    ("std_scaler", StandardScaler()),
                    ("regul_reg", model),
                ))
        model.fit(X, y)
        y_new_regul = model.predict(X_new)
        lw = 2 if alpha > 0 else 1
        plt.plot(X_new, y_new_regul, style, linewidth=lw, label=r"$\alpha = {}$".format(alpha))
    plt.plot(X, y, "b.", linewidth=3)
    plt.legend(loc="upper left", fontsize=15)
    plt.xlabel("$x_1$", fontsize=18)
    plt.axis([0, 3, 0, 4])

plt.figure(figsize=(8,4))
plt.subplot(121)  # 1 row and 2 columns subfigure 1
plot_model(Ridge, polynomial=False, alphas=(0, 10, 100))
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.subplot(122)  # 1 row and 2 columns subfigure 2
plot_model(Ridge, polynomial=True, alphas=(0, 10**-5, 1))

plt.show()

Lasso Regression(套索回归)

$J(\theta ) = MSE(\theta ) + \alpha \sum_{i=1}^{n} \left | \theta_i \right |$
use l1 norm
Lasso Regression automatically performs feature selection and outputs a spare model

from sklearn.linear_model import Lasso

rnd.seed(42)
m = 20
X = 3 * rnd.rand(m, 1)
y = 1 + 0.5 * X + rnd.randn(m, 1) / 1.5
X_new = np.linspace(0, 3, 100).reshape(100, 1)

def plot_model(model_class, polynomial, alphas, **model_kargs):
    for alpha, style in zip(alphas, ("b-", "g--", "r:")):
        model = model_class(alpha, **model_kargs) if alpha > 0 else LinearRegression()
        if polynomial:
            model = Pipeline((
                    ("poly_features", PolynomialFeatures(degree=10, include_bias=False)),
                    ("std_scaler", StandardScaler()),
                    ("regul_reg", model),
                ))
        model.fit(X, y)
        y_new_regul = model.predict(X_new)
        lw = 2 if alpha > 0 else 1
        plt.plot(X_new, y_new_regul, style, linewidth=lw, label=r"$\alpha = {}$".format(alpha))
    plt.plot(X, y, "b.", linewidth=3)
    plt.legend(loc="upper left", fontsize=15)
    plt.xlabel("$x_1$", fontsize=18)
    plt.axis([0, 3, 0, 4])

plt.figure(figsize=(8,4))
plt.subplot(121)
plot_model(Lasso, polynomial=False, alphas=(0, 0.1, 1))
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.subplot(122)
plot_model(Lasso, polynomial=True, alphas=(0, 10**-7, 1))

plt.show()

/usr/local/lib/python3.5/dist-packages/sklearn/linear_model/coordinate_descent.py:484: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Fitting data with very small alpha may cause precision problems.
  ConvergenceWarning)

# handson-ml -- 04 -- 37-38
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

t1a, t1b, t2a, t2b = -1, 3, -1.5, 1.5

# ignoring bias term
t1s = np.linspace(t1a, t1b, 500)
t2s = np.linspace(t2a, t2b, 500)
t1, t2 = np.meshgrid(t1s, t2s)
T = np.c_[t1.ravel(), t2.ravel()]
Xr = np.array([[-1, 1], [-0.3, -1], [1, 0.1]])
yr = 2 * Xr[:, :1] + 0.5 * Xr[:, 1:]

J = (1/len(Xr) * np.sum((T.dot(Xr.T) - yr.T)**2, axis=1)).reshape(t1.shape)

N1 = np.linalg.norm(T, ord=1, axis=1).reshape(t1.shape)
N2 = np.linalg.norm(T, ord=2, axis=1).reshape(t1.shape)

t_min_idx = np.unravel_index(np.argmin(J), J.shape)
t1_min, t2_min = t1[t_min_idx], t2[t_min_idx]

t_init = np.array([[0.25], [-1]])

def bgd_path(theta, X, y, l1, l2, core = 1, eta = 0.1, n_iterations = 50):
    path = [theta]
    for iteration in range(n_iterations):
        gradients = core * 2/len(X) * X.T.dot(X.dot(theta) - y) + l1 * np.sign(theta) + 2 * l2 * theta

        theta = theta - eta * gradients
        path.append(theta)
    return np.array(path)

plt.figure(figsize=(12, 8))
for i, N, l1, l2, title in ((0, N1, 0.5, 0, "Lasso"), (1, N2, 0,  0.1, "Ridge")):
    JR = J + l1 * N1 + l2 * N2**2
    
    tr_min_idx = np.unravel_index(np.argmin(JR), JR.shape)
    t1r_min, t2r_min = t1[tr_min_idx], t2[tr_min_idx]

    levelsJ=(np.exp(np.linspace(0, 1, 20)) - 1) * (np.max(J) - np.min(J)) + np.min(J)
    levelsJR=(np.exp(np.linspace(0, 1, 20)) - 1) * (np.max(JR) - np.min(JR)) + np.min(JR)
    levelsN=np.linspace(0, np.max(N), 10)
    
    path_J = bgd_path(t_init, Xr, yr, l1=0, l2=0)
    path_JR = bgd_path(t_init, Xr, yr, l1, l2)
    path_N = bgd_path(t_init, Xr, yr, np.sign(l1)/3, np.sign(l2), core=0)

    plt.subplot(221 + i * 2)
    plt.grid(True)
    plt.axhline(y=0, color='k')
    plt.axvline(x=0, color='k')
    plt.contourf(t1, t2, J, levels=levelsJ, alpha=0.9)
    plt.contour(t1, t2, N, levels=levelsN)
    plt.plot(path_J[:, 0], path_J[:, 1], "w-o")
    plt.plot(path_N[:, 0], path_N[:, 1], "y-^")
    plt.plot(t1_min, t2_min, "rs")
    plt.title(r"$\ell_{}$ penalty".format(i + 1), fontsize=16)
    plt.axis([t1a, t1b, t2a, t2b])

    plt.subplot(222 + i * 2)
    plt.grid(True)
    plt.axhline(y=0, color='k')
    plt.axvline(x=0, color='k')
    plt.contourf(t1, t2, JR, levels=levelsJR, alpha=0.9)
    plt.plot(path_JR[:, 0], path_JR[:, 1], "w-o")
    plt.plot(t1r_min, t2r_min, "rs")
    plt.title(title, fontsize=16)
    plt.axis([t1a, t1b, t2a, t2b])

for subplot in (221, 223):
    plt.subplot(subplot)
    plt.ylabel(r"$\theta_2$", fontsize=20, rotation=0)

for subplot in (223, 224):
    plt.subplot(subplot)
    plt.xlabel(r"$\theta_1$", fontsize=20)

plt.show()

$g\left ( \theta , J \right )=\nabla_{\theta}MSE\left ( \theta \right )+\alpha\begin{pmatrix} sign\left ( \theta_1 \right )\\ sign\left ( \theta_2 \right )\\ ...\\ sign\left ( \theta_n \right ) \end{pmatrix}\ where \ sign\left ( \theta_i \right )=\left\{\begin{matrix} -1,\ if:\theta_i<0\\ 0,\ if:\theta_i=0\\ +1,\ if:\theta_i>0 \end{matrix}\right.$
Lasso isn't differentiable at $\theta_i = 0$ , but Gradient Descent still work fine use $g(\theta, J)$ to instand when $\theta_i = 0$

from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X, y)
lasso_reg.predict([[1.5]])

array([ 1.53788174])

Elastic Net

$J(\theta ) = MSE(\theta) + ra \sum_{i=1}^{n}\left | \theta_i \right | + \frac{1-r}{2} \alpha \sum_{i=1}^{n} \theta_{i}^{2} \ \ \ when \ r=0 \ is\ Ridge \ and \ r=1\ is\ Lasso$
Elastic Nat is preferred over Lasso since Lasso may behave erratically when the number of features is greater than the number of training instance or several features are strongly correlated

from sklearn.linear_model import ElasticNet
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X, y)
elastic_net.predict([[1.5]])

array([ 1.54333232])

Early Stopping

stop training as soon as the validation error reaches a minimum : beautiful free lunch
not so smooth : stop only after the validation error has been above the minimum for some time

from sklearn.base import clone

X_train, X_val, y_train, y_val = train_test_split(X[:50], y[:50].ravel(), test_size=0.5, random_state=10)

poly_scaler = Pipeline((
        ("poly_features", PolynomialFeatures(degree=90, include_bias=False)),
        ("std_scaler", StandardScaler()),
    ))

X_train_poly_scaled = poly_scaler.fit_transform(X_train)
X_val_poly_scaled = poly_scaler.transform(X_val)

sgd_reg = SGDRegressor(n_iter=1, warm_start=True, penalty=None, learning_rate="constant", eta0=0.0005)
#warm_start=True:when fit() is called, it just continues training where it left off instead of restarting from scratch

minimum_val_error = float('inf')
best_epoch = None
best_model = None
for epoch in range(100):
    sgd_reg.fit(X_train_poly_scaled, y_train)
    y_val_predict = sgd_reg.predict(X_val_poly_scaled)
    val_error = mean_squared_error(y_val_predict, y_val)
    if val_error < minimum_val_error:
        minimum_val_error = val_error
        best_epoch = epoch
        best_model = clone(sgd_reg)
        
print(best_epoch, best_model)

98 SGDRegressor(alpha=0.0001, average=False, epsilon=0.1, eta0=0.0005,
       fit_intercept=True, l1_ratio=0.15, learning_rate='constant',
       loss='squared_loss', n_iter=1, penalty=None, power_t=0.25,
       random_state=None, shuffle=True, verbose=0, warm_start=True)

Logistic Regression¶

some regression algorithms can be used for classification as well and vice versa

Estimating Probabilities

$\hat{p} = h_{\theta }(x) = \sigma (\theta^T \cdot x)\ \ ,\ \ \sigma(t)=\frac{1}{1+e^{(-t)}}\ \ ,\ \ \hat y=\left\{\begin{matrix} 0,\ if\ \hat p < 0.5\\ 1,\ if\ \hat p \geq 0.5 \end{matrix}\right.$

Train and Cost Function

$Cost\ function\ for\ signal\ instance:c(\theta ) = \left\{\begin{matrix} -log(\hat p)\ \ \ \ if\ y=1; \\ -log(1- \hat p) \ \ if\ y=0. \end{matrix}\right.\\ Cost\ function:J(\theta )=-\frac{1}{m}\sum_{i=1}^{m}\left [ y^{(i)}log(\hat p^{(i)}) +(1-y^{(i)})log(1-\hat p^{(i)}) \right ]\\ Cost\ function\ partial\ derivatives:\frac{\partial}{\partial \theta_j}=\frac{1}{m} \sum_{i=1}^{m} \left (\sigma (\theta^T \cdot x^{(i)}) - y^{(i)} \right)x_{j}^{(i)}$

Decision Boundaries

from sklearn import datasets
iris = datasets.load_iris()
list(iris.keys())

['DESCR', 'feature_names', 'data', 'target_names', 'target']

X = iris["data"][:,3:]
y = (iris["target"] == 2).astype(np.int)

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

X_new = np.linspace(0, 3, 1000).reshape(-1, 1)
y_proba = log_reg.predict_proba(X_new)

plt.plot(X_new, y_proba[:, 1], 'g-', label="Iris-Vriginica")
plt.plot(X_new, y_proba[:, 0], 'b--', label="Not Iris-Vriginica")

plt.show()

log_reg.predict([[1.7],[1.5]])

array([1, 0])

from sklearn.linear_model import LogisticRegression

X = iris["data"][:, (2, 3)]  # petal length, petal width
y = (iris["target"] == 2).astype(np.int)

log_reg = LogisticRegression(C=10**10)
log_reg.fit(X, y)

x0, x1 = np.meshgrid(
        np.linspace(2.9, 7, 500).reshape(-1, 1),
        np.linspace(0.8, 2.7, 200).reshape(-1, 1),
    )
X_new = np.c_[x0.ravel(), x1.ravel()]

y_proba = log_reg.predict_proba(X_new)

plt.figure(figsize=(10, 4))
plt.plot(X[y==0, 0], X[y==0, 1], "bs")
plt.plot(X[y==1, 0], X[y==1, 1], "g^")

zz = y_proba[:, 1].reshape(x0.shape)
contour = plt.contour(x0, x1, zz, cmap=plt.cm.brg)


left_right = np.array([2.9, 7])
boundary = -(log_reg.coef_[0][0] * left_right + log_reg.intercept_[0]) / log_reg.coef_[0][1]

plt.clabel(contour, inline=1, fontsize=12)
plt.plot(left_right, boundary, "k--", linewidth=3)
plt.text(3.5, 1.5, "Not Iris-Virginica", fontsize=14, color="b", ha="center")
plt.text(6.5, 2.3, "Iris-Virginica", fontsize=14, color="g", ha="center")
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.axis([2.9, 7, 0.8, 2.7])
plt.show()

can be regularized using $l_1$ or $l_2$ penalties
the hyperparameter controlling the regularized strength is not alpha but its inverse C

SoftMax Regression

Logistic Regression can be generalized to support multiple classes directly
SoftMax model first compute the secore $s_k(x)$ , then estimates the probability of each class by applying the softmax function
$SoftMax\ Secore\ :\ s_k(x)=\theta_{k}^{T} \cdot x \\ \Theta = \begin{pmatrix} \theta_1\\ \theta_2\\ ...\\ \theta_K \end{pmatrix} \\ SoftMax\ function\ :\ \hat p_k = \sigma (s(x))_k = \frac{e^{s(x)}}{\sum_{j=1}^{K} e^{s_j(x)}}$
$SoftMax\ classifier\ prediction\ :\ \hat y = \underset{x}{\operatorname{argmax}} \sigma (s(x))_k=\underset{x}{\operatorname{argmax}} s_k(x)=\underset{x}{\operatorname{argmax}} (\theta_{k}^{T} \cdot x)$
predict only one class at one time

Cross entropy

$J(\Theta ) = - \frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K} y_{k}^{(i)} log(\hat p_{k}^{(i)})\ ,\ y_{k}^{(i)} =1,if\ target\ class\ is\ 1^th$
$K=2\ :\ Cross\ entropy=Cost\ function:J(\theta )=-\frac{1}{m}\sum_{i=1}^{m}\left [ y^{(i)}log(\hat p^{(i)}) +(1-y^{(i)})log(1-\hat p^{(i)}) \right ]\\$
$两个分布p和q的交叉熵：H(p,q)=-\sum_{x} p(x)logq(x)$
$Cross\ entropy\ gradient\ vector\ for\ class\ k\ :\ \nabla_{\theta_k}J(\Theta)=\frac{1}{m}\sum_{i=1}^{m} \left (\hat p_{k}^{(i)}-y_{k}^{(i)}\right)x^{(i)}$

X = iris['data'][:, (2, 3)]
y = iris['target']

softmax_reg = LogisticRegression(multi_class='multinomial', solver='lbfgs', C=10)
softmax_reg.fit(X, y)

softmax_reg.predict([[5,2]])
softmax_reg.predict_proba([[5, 2]])

array([[  6.33134078e-07,   5.75276067e-02,   9.42471760e-01]])

x0, x1 = np.meshgrid(
        np.linspace(0, 8, 500).reshape(-1, 1),
        np.linspace(0, 3.5, 200).reshape(-1, 1),
    )
X_new = np.c_[x0.ravel(), x1.ravel()]


y_proba = softmax_reg.predict_proba(X_new)
y_predict = softmax_reg.predict(X_new)

zz1 = y_proba[:, 1].reshape(x0.shape)
zz = y_predict.reshape(x0.shape)

plt.figure(figsize=(10, 4))
plt.plot(X[y==2, 0], X[y==2, 1], "g^", label="Iris-Virginica")
plt.plot(X[y==1, 0], X[y==1, 1], "bs", label="Iris-Versicolor")
plt.plot(X[y==0, 0], X[y==0, 1], "yo", label="Iris-Setosa")

from matplotlib.colors import ListedColormap
custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])

plt.contourf(x0, x1, zz, cmap=custom_cmap, linewidth=5)
contour = plt.contour(x0, x1, zz1, cmap=plt.cm.brg)
plt.clabel(contour, inline=1, fontsize=12)
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.legend(loc="center left", fontsize=14)
plt.axis([0, 7, 0, 3.5])
plt.show()

人脑之战

导航

公告

我的标签

阅读排行榜

评论排行榜

推荐排行榜

最新评论

Notes ： <Hands-on ML with Sklearn & TF> Chapter 4