【整理总结】机器学习复习

An Introduction to Machine Learning

What is Machine Learning?

Machine Learning is more suitable to be described as ‘Model Learning’. For the given data, we choose suitable mathematical model to fit the data, and try to explain the meaning of data. We try to fit the relationship between inputs and outputs.
What are the differences between supervised learning and unsupervised learning?

For the supervised learning, the model has examples of inputs and corresponding outputs, it needs to predict outputs on new inputs. For the unsupervised learning, the model only have inputs, the model should automatically discover knowledge, like labels, features, structure, etc.
What are the differences between classification and regression?

Classification usually handles discrete data, it predicts the discrete labels of new inputs. Regression usually deals with continuous data, it predicts the continuous labels of new inputs.
What are the differences between generative model and discriminative model?

Generative models and discriminative models all belongs to supervised learning. Generative models model the joint probability of label and observation $p(x,y)$, and then use the Bayes rule $p(y|x)=\frac{p(x,y)}{p(x)}$ for prediction, includes classic models like Naive Bayes, Hidden Markov, etc. Discriminative models model the posterior probability of label given observation $p(y|x)$ or directly model the decision function $y=f(x)$, includes classic models like logistic regression, perceptron, etc.
What is the process through the machine learning?

Firstly, we propose hypothesis of our model, make math models with parameters or structures. Secondly, we propose learning criteria to estimate the parameters, like Maximum Likelihood Estimation, Cost Function Optimization, etc. Thirdly, we make decision according to the learning criterion.

📌: The methods of learning criteria include Perceptron Criterion, Least Mean Square (LMS), Minimum Cross Entropy (CE), Maximum Margin Criterion, Maximum Likelihood, etc. The optimization methods for cost function include Gradient Descent (GD), Stochastic Gradient Descent (SGD), Newton Method, etc.

Linear Regression

How to describe a linear regression model?

Firstly, the hypothesis is

\[h_{\theta}(x)=\sum_{i=1}^n \theta_ix_i=\theta^T x \]
the parameter is vector $\theta^T$. Secondly, the learning criteria is minimum cost function, and the cost function is defined as mean square:

\[J(\theta)=\frac{1}{2}\sum_{i=1}^m (\theta^Tx^{(i)}-y^{(i)})^2 \]
To minimize $J(\theta)$, we can try to get the analytical solution by calculating the gradient of $J(\theta)=\frac{1}{2}(X\theta-y)^T(X\theta-y)$ by matrix derivatives, and let the gradient equals zero to obtain the close-form solution $\theta^{\star}=(X^T X)^{-1}X^Ty$. However, the calculation tends to be very hard usually, we can also try to get numerical solution by the Iteration of model parameters through optimization algorithm, like GD Optimization:

\[\theta:=\theta-\alpha \frac{\partial}{\partial \theta}J(\theta)=\theta-\alpha \sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)} \]

Logistic Regression & Softmax Regression

What is the function of Logistic Regression?

Logistic Regression is a binary classification model. It has a linear decision boundary (hyperplane), but with a nonlinear activation function (Sigmoid function) to model the posterior probability.
How to describe a logistic regression model?

Since it is a binary classification model, which just focus on $y=0$ or $y=1$, the hypothesis is written as :

\[p(y=1|x;\theta)=h_{\theta}(x)=\delta(\theta^Tx)=\frac{1}{1+e^{-\theta^Tx}} \]
\[p(y=0|x;\theta)=1-h_{\theta}(x) \]
where $\delta()$ means the sigmoid function. A compact form of hypothesis is :

\[p(y|x;\theta)=(h_{\theta}(x))^y(1-h_{\theta}(x))^{1-y} \]
The hypothesis models the posterior probability, so we use likelihood function to maximize the probability. The learning criteria is Maximum Likelihood Estimation, where Likelihood Function is described as

\[L(\theta)=\prod_{i=1}^Np(y^{(i)}|x^{(i)};\theta)=\prod_{i=1}^N(h_{\theta}(x^{(i)}))^{y^{(i)}}(1-h_{\theta}(x^{(i)}))^{1-y^{(i)}} \]
In MLE algorithm, to reduce calculation, we use $log$ function to change $\prod$ into $\sum$ . So it leads to an unconstraint optimization:

\[\max \limits_{\theta} \sum_{i=1}^N y^{(i)}logh_{\theta}(x^{(i)})+(1-y^{(i)})log(1-h_{\theta}(x^{(i)})) \]
then we can choose an optimization method, like GD, SGD, Newton Method, etc.

💡Gradient Ascent：

\[\theta:=\theta+\alpha \sum_{i=1}^N(y^{(i)}-h_{\theta}(x^{(i)}))x^{(i)} \]
💡Stochastic Gradient Descent：

\[\theta:=\theta+\alpha (y-h_{\theta}(x))x \]
💡Newton’s Method：

\[\theta:=\theta-H^{-1}\nabla J(\theta) \]
\[\nabla J(\theta)=\frac{1}{N}\sum_{i=1}^N(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)} \]
\[H=\frac{1}{N}\sum_{i=1}^Nh_{\theta}(x^{(i)})^T(1-h_{\theta}(x^{(i)}))x^{(i)}(x^{(i)})^T \]
\[J(\theta)=-L(\theta) \]
We want to minimize $J(\theta)$.

🎈：Pay attention to the gradient descent of Sigmoid function:

\[\frac{\partial}{\partial \theta}h_{\theta}(x^{(i)})=\frac{\partial}{\partial \theta}\delta(\theta^Tx^{(i)})=h_{\theta}(x^{(i)})(1-h_{\theta}(x^{(i)}))\frac{\partial}{\partial \theta}\theta^Tx^{(i)} \]
🎃: The key idea of Newton’s Method is using second-order Tayler extension approximately explains the function. To solve the minimum problem, we want to get the solution of $\nabla J(\theta)=0$. Since

\[f(x)=f(x_k)+f^{\prime}(x_k)(x-x_k)+\frac{1}{2}f^{\prime \prime}(x-x_k)^2+...\approx f(x_k)+f^{\prime}(x_k)(x-x_k) \]
then

\[f(x)=0\Rightarrow f(x_k)+f^{\prime}(x_k)(x-x_k)=0\Rightarrow x_{k+1}=x_k-\frac{f(x_k)}{f^{\prime}(x_k)} \]
So $$\theta:=\theta-H^{-1}\nabla J(\theta)$$
Why the cost function of Logistic Regression is cross-entropy?

The cross entropy of the distribution $q$ relative to a distribution $p$ over a given set is defined as follows: $H(p,q)=-E_{p}[logq]=-\sum_iplog(q)$, it describes the distance between these two probability distributions, the less cross entropy, the more similar of the two distributions. In Logistic, the true probability $p$ is the true label, and the given distribution $q$ is the predicted value of the current model, so the cost function can be written as

\[\min \limits_{\theta} -\sum_{i=1}^N [y^{(i)}logh_{\theta}(x^{(i)})+(1-y^{(i)})log(1-h_{\theta}(x^{(i)}))] \]
it describes the distcance between true model and ideal model.
How to describe a Softmax regression model?

Softmax Regression is also known as the Maximum Entropy Model (in NLP) or Multi-class Logistic Regression. Hypothesis is written as

\[p(y=j|x;\theta)=h_j(x)=\frac{e^{\theta_{j}^T x}}{\sum_{j^{\prime}=1}^C e^{\theta_{j^{\prime}}^T x}}, j=1,2,...,C \]
Then we write the likelihood function of $L(\theta)$, and use log function to translate $\prod$ into $\sum$

\[L(\theta)=\sum_{i=1}^N logp(y^{(i)}|x^{(i)};\theta)=\sum_{i=1}^Nlog \prod_{j=1}^C h_{j}(x)^{1\{y^{(i)}=j\}}=\sum_{i=1}^N \sum_{j=1}^C 1\{y^{(i)}=j\} log h_j(x) \]
use GD:

\[\theta:= \theta+\alpha \sum_{i=1}^N(1\{y^{(i)}=k\}-h_k(x^{(i)}))x^{(i)} \]
or SGD:

\[\theta:=\theta+\alpha(1\{y=k\}-h_k(x))x \]
🌈：Because the sum of Softmax algorithm’s output is 1, and the calculation result of each class is the Confidence, for the output of class $k$ , we hope the output is closer to 1 the better, then we treat the $(1-h_{k}(x))$ as the error for the class $k$, and the $-h_k (x)$ as the error for classes except $k$.

Perceptron

How to describe the classic Perceptron model?

The classic Perceptron model is used for binary classification. Perceptron model requires the dataset to be seperateble, then it can derive a hyperplane S to completely divide the two classes. Hypothesis is:

\[h_w(x)=\left\{ \begin{align} 1 \ \ if \ w^Tx \ge0 \\0\ \ if \ w^Tx <0 \end{align} \right. \]
then we use cost function optimization as learning creterian, the cost function of Perceptron is a little different in format with the above models, but it still follows “Error x Feature” rule. We want the misclassfied samples to be more and more fewer, so

\[J_p(x)=\sum_{i=1}^N(h_w(x^{(i)})-y^{(i)})w^Tx^{(i)} \]
if the sample is handled correctly, then the cost equals zero, but if handled uncorrectly, the cost equals the weight of the sample times the sample’s feature, and the total cost equals the sum of total misclassified samples’ weight times feature which can be regarded as the total distcance between the misclassifed samples and the hyperplane. The learning criteria of Perceptron is also special, but it can be understanded as the SGD optimization:

\[w:=w+\alpha(y-h_w(x))x \]
P.S.: If $y=1$ and $h_w(x)=0$, we should add the weight, so $w=w+\alpha x$, if $y=0$ and $h_w(x)=1$, we should minus the weight, so $w=w-\alpha x$, for the classified correctly weights, we don’t change them.

👀: “Error x Feature” rule means that we apply penalty on the wrong predictions.
How to describe the multi-classic Perceptron model?

The Multi-class Perceptron is an extension of standard perceptron to solve multi-class classification problems. Hypothesis is

\[C^{\star}=arg \max \limits_{j=1,...,C}w_j^Tx \]
Cost function is

\[J_p(w)=\sum_{k=1}^N(\max\limits_{j=1,...,C}w_j^Tx^{(k)}-w_{y^{(k)}}^Tx^{(k)}) \]
The parameter update rule is

\[w_j:=w_j-\alpha(1\{j=c^{(k)}\}-1\{j=y^{(k)}\})x^{(k)} \]

posted @ 2020-07-11 20:58 ColleenHL 阅读(482) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

【整理总结】机器学习复习

An Introduction to Machine Learning

Linear Regression

Logistic Regression & Softmax Regression

Perceptron

公告