[Machine learning] Logistic regression
1. Variable definitions
m
: training examples' count
\(X\) : design matrix. each row of \(X\) is a training example, each column of \(X\) is a feature
\[X =
\begin{pmatrix}
1 & x^{(1)}_1 & ... & x^{(1)}_n \\
1 & x^{(2)}_1 & ... & x^{(2)}_n \\
... & ... & ... & ... \\
1 & x^{(n)}_1 & ... & x^{(n)}_n \\
\end{pmatrix}\]
\[\theta =
\begin{pmatrix}
\theta_0 \\
\theta_1 \\
... \\
\theta_n \\
\end{pmatrix}\]
2. Hypothesis
\[x=
\begin{pmatrix}
x_0 \\
x_1 \\
... \\
x_n \\
\end{pmatrix}
\]
\[h_\theta(x) = g(\theta^T x) = g(x_0\theta_0 + x_1\theta_1 + ... + x_n\theta_n) = \frac{1}{1 + e^{(-\theta^Tx)}},
\]
sigmoid function
\[g(z) = \frac{1}{1 + e^{-z}},
\]
g = 1 ./ (1 + e .^ (-z));
3. Cost function
\[J(\theta) = \frac{1}{m}\sum_{i=1}^m[-y^{(i)}log(h_\theta(x^{(i)})) - (1-y^{(i)})log(1 - h_\theta(x^{(i)}))],
\]
vectorization edition of Octave
J = -(1 / m) * sum(y' * log(sigmoid(X * theta)) + (1 - y)' * log(1 - sigmoid(X * theta)));
4. Goal
find \(\theta\) to minimize \(J(\theta)\), \(\theta\) is a vector here
4.1 Gradient descent
\[\frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}_j,
\]
repeat until convergence{
\(\theta_j := \theta_j - \alpha \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x^{(i)}_j\)
}
vectorization
\(S\)
\[=
\begin{pmatrix}
h_\theta(x^{(1)})-y^{(1)} & h_\theta(x^{(2)})-y^{(2)} & ... & h_\theta(x^{(n)}-y^{(n)})
\end{pmatrix}
\begin{pmatrix}
x^{(1)}_0 & x^{(1)}_1 & ... & x^{(1)}_3 \\
x^{(2)}_0 & x^{(2)}_1 & ... & x^{(2)}_3 \\
... & ... & ... & ... \\
x^{(n)}_0 & x^{(n)}_1 & ... & x^{(n)}_3 \\
\end{pmatrix}
\]
\[=
\begin{pmatrix}
\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}_0 &
\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}_1 &
... &
\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}_n
\end{pmatrix}
\]
\[\theta = \theta - S^T
\]
\[h_\theta(X) = g(X\theta) = \frac{1}{1 + e^{(-X\theta)}}
\]
\(X\theta\) is nx1, \(y\) is nx1
\(\frac{1}{1+e^{(-X\theta)}} - y\) is nx1
\[\frac{1}{1 + e^{(-X\theta)}} - y=
\begin{pmatrix}
h_\theta(x^{(1)})-y^{(1)} & h_\theta(x^{(2)})-y^{(2)} & ... & h_\theta(x^{(n)})-y^{(n)}
\end{pmatrix}
\]
\[\theta = \theta - \alpha(\frac{1}{1 + e^{(-X\theta)}} - y)X
\]
5. Regularized logistic regression
to avoid overfitting or underfitting
Cost function
\[J(\theta) = \frac{1}{m}\sum_{i=1}^m[-y^{(i)}log(h_\theta(x^{(i)})) - (1-y^{(i)})log(1 - h_\theta(x^{(i)}))] + \frac{\lambda}{2m} \sum_{j=1}^m \theta^2_j,
\]
Gradient descent
\[\frac{\partial J(\theta)}{\partial \theta_0} = \frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}_0,
\]
\[\frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}_j, (j \ge 1)
\]