- what is Machine Learning?
Terms:
L : loss, expectation of differences between \(y_{infer}\) and \(y_{true}\).
$L_D $: loss of distribution, $L_D \overset {def} = E[l] $, where l is loss of an example.
for regression: mean squared error is $ L_D \overset {def}= E[(y_{infer} - y_{true})^2] $ .
for classification: cross-entropy loss is $ L_D \overset {def}= E[-\sum_{i=1}^N { P_i log_2 \hat {P_i}}]$ , where N is number of categories of y, \(P_i\) is true probability that y is category i, \(\hat {P_i}\) is inferred probability that y is category i.
\(L_S\): loss of sample, $L_S \overset {def} = \overline {l} $.
A(S): running learning algorithm A on sample S, which returns an inferring function h, $y_{infer} = h(X) $.
PAC learnable: A model H is PAC learnable if there exits a function \(m_H(\epsilon, \delta)\), and a learning algorithm A, with the following property:
for every \(\epsilon\), \(\delta \in (0, 1)\), for every distribution D over X, and for every labeling function y = f(x) (in this case y|X is fixed) , if f in H, then when running learning algorithm A on \(m \geq m_H(\epsilon, \delta)\) i.i.d (independent and identically distributed) examples generated by D and labelled by f, the learning algorithm returns an inferring function h (over choice of the examples) which satisfies:
\(P[L_D(h) \leq \epsilon] \geq (1 - \delta)\)
agnostic PAC learnable: A model H is agnostic PAC learnable if there exits a function \(m_H(\epsilon, \delta)\), and a learning algorithm A, with the following property:
for every \(\epsilon\), \(\delta \in (0, 1)\), for every distribution D over (X,y) (in this case y|X is variable) ,when running learning algorithm A on \(m \geq m_H(\epsilon, \delta)\) i.i.d (independent and identically distributed) examples generated by D, the learning algorithm returns an inferring function h (over choice of the examples) which satisfies:
\(P[L_D(h) \leq \mathop {min} \limits_{h^* \in H} L_D(h^*) + \epsilon] \geq (1 - \delta)\)
Note:
why need for every distribution? because distribution of a machine learning task is not known, so we can only discuss the case suitable for every distribution to ensure it suitable for a distribution of a machine learning task.
Hoeffding Inequality: todo
Machine Learning = statistical learning.The goal of Machine Learning is to get small enough \(L_D\). Since the distribution is unknown, what can be seen is a sample S generated by the distribution, so we can only estimate \(L_D\) via the sample.
A(S) is a random variable over choice of sample S, So "get small enough \(L_D\)" means that:
\(P[L_D(A(S)) \leq \epsilon] \geq (1 - \delta)\), in the case \(y|X\) is fixed,
$P[L_D(A(S)) \leq \mathop {min} L_D + \epsilon] \geq (1 - \delta) $, in the case \(y|X\) is variable,
where \(\epsilon\) and \(\delta\) are small numbers in (0, 1).
\(L_D\) is minimal when $ y_{infer} = E[y|X] $, because:
X is known, namely both \(y_{infer}\) and \(y_{true}\) are on condition of known X,
And, when X known, \(y_{infer}\) is a fixed number, So:
So,
Neural network and decision forest are universal approximator, namely they can approximate any function to any precision with enough units. Here a unit represents a node in neural network or decision forest.
So, for any machine learning task, we can use neural network to approximate \(y_{infer} = E[y_{true}|X]\) to any precision.
model H has finite VC dimension \(\equiv\) model H is agnostic PAC learnable \(\equiv\) \(A(S) = \mathop {argmin} \limits_{h \in H} L_S(h)\) is an agnostic PAC learner for H.
For neural network, finite parameters \(\equiv\) finite VC dimension, So a neural network with finite parameters is agnostic PAC learnable, and \(A(S) = \mathop {argmin} \limits_{h \in H} L_S(h)\) is an agnostic PAC learner for neural network.
Now,assuming we use a neural network model H, with parameters W, and the learning algorithm is \(A(S) = \mathop {argmin} \limits_{h \in H} L_S(h)\) .
Fix H, \(\epsilon\), \(\delta\), If we have enough training examples, namely \(m \geq m_H(\epsilon, \delta)\), then according to definition of agnostic PAC learnable,
\(P[L_D(A(S)) \leq \mathop {min} \limits_{h^* \in H} L_D(h^*) + \epsilon] \geq (1 - \delta)\) .
How can we know if \(m \geq m_H(\epsilon, \delta)\) ?
According to Hoeffding Inequality, \(|L_D(A(S)) - L_V(A(S))|\) can be bounded very small with enough validation examples (e.g. 300):
$ P[|L_D(A(S)) - L_V(A(S))| \leq \epsilon] \geq 1 - {2exp({{-2m\epsilon^2} \over (b-a)^2})} $,
If m_V = 300, \({\epsilon \over (b-a)} = 0.1\), then:
\(P[|L_D(A(S)) - L_V(A(S))| \leq \epsilon] \geq 0.995\)
So we can use \(L_V(A(S))\) as estimate of \(L_D(A(S))\).
we can draw relation curves of validation loss, training loss vs number of training examples. As number of training examples increases, training loss increases since more examples are harder to fit them all, and validation loss decreases since more training examples means that the training sample approximates the distribution more. And when $ m \geq m_H(\epsilon, \delta) $ , both validation loss and training loss converge to $ \mathop {min} \limits_{h^* \in H} L_D(h^*) $.
Note:
when drawing relation curves of loss and number of training examples, training examples should be prefixed, namely add new examples in addition to pre training set.
How can we know if \(\mathop {min} \limits_{h^* \in H} L_D(h^*)\) approximates \(minL_D\) ?
As capacity of neural network model H increases, \(\mathop {min} \limits_{h^* \in H} L_D(h^*)\) decreases, and at last converges to \(minL_D\) .
Now, If we have enough training examples, the goal of machine learning:
$P[L_D(A(S)) \leq \mathop {min} L_D + \epsilon] \geq (1 - \delta) $
is achieved.
How many is \(m_H(\epsilon, \delta)\) ?
H is agnostic PAC learnable with sample complexity:
$C_1 {{d + ln(1 / \delta)} \over {\epsilon^2}} \leq m_H(\epsilon, \delta) \leq C_2 {{d + ln(1 / \delta)} \over {\epsilon^2}} $
where d is VC Dimension of H, \(C_1\) and \(C_2\) are constants independent on H.
Attention:
Although, in the book, the fundamental theorem of statistical learning is discussed for binary classification with 0-1 loss, based on my experiences, it's ok to generalize to y is multiple categories, numeric.
If we do not have enough training examples, how can we try to
achieve the goal of machine learning:
$P[L_D(A(S)) \leq \mathop {min} L_D + \epsilon] \geq (1 - \delta) $ ?
We assume that capacity of H is enough to get small \(L_S\).
Now the question is how can we try to get small $ L_D(A(S)) - L_S(A(S)) $ when not $ m \geq m_H(\epsilon, \delta) $ with respect to a specified H ?
In this case, L_S(A(S)) is small, and $L_D(A(S)) - L_S(A(S)) $ is not small, it is called overfit.
The \(min L_D(h)\) without restriction on h is gotten when $ y_{infer} = h(X) = E[y|X]$. With restriction $ h \in H $, the more $ y_{infer} = h(X)$ approximates \(E[y|X]\), the smaller \(L_D(h)\) is. the more $ y_{infer} = h(X)$ approximates $\underset {S} mean[y|X] $, the smaller $L_S(h) $ is.
\(A(S) = \mathop {argmin} \limits_{h \in H} L_S(h)\)
So, A(S) returns a h which satisfies that $ y_{infer} = h(X)$ approximates \(\underset {S} mean[y|X]\) most with respect to \(h \in H\).
Although the sample is generated by the distribution, since there are finite examples in the sample, the main trend of the sample is consistent with the distribution, but at some tiny locals, there is a little difference between $\underset {S} mean[y|X] $ and $ E[y|X] $, the larger the sample size is, the smaller the difference is. This is the cause of overfit.
Methods to relieve overfit for neural network:
Early Stopping:
since gradient descent optimization updates W at the direction of negative gradient, which is the direction the objective function decreases fastest, namely with $||\Delta W ||_2 $ fixed, the objective function decreases most. It learns main trend of the sample first, which brings more decrease of \(L_S(h)\), then learns tiny locals specific to the sample. Learning tiny locals specific to the sample is not wanted, it leads to overfit, So early stopping the optimization when it starts to learn tiny locals specific to the sample is good for relieving overfit. So how can we know when it starts to learn tiny locals specific to the sample ? when validation loss dose not improve any more it starts to learn tiny locals specific to the sample.
Regularization:
If the loss function is \(\beta\)-smooth and nonnegative, then, the RLM (regularized loss minimization) rule with the regularizer \(\lambda||w||^2\), where \(\lambda \geq {2\beta \over m}\), satisfies:
\(\underset {S \sim D^m} E[L_D(A(S)) - L_S(A(S))] \leq {48 \beta \over {\lambda m}} E[L_S(A(S))]\)
Dropout:
Set a fraction (e.g. 0.5) of nodes in one layer to zero randomly at each update when training, use all nodes in the layer and times 0.5, namely activation(0.5*(WX + b)), when inferring. In this way, W updating depends less on coexisting of some features, namely, each feature's effect on W is more independent on other features, and when training, each update using random 0.5 of nodes, at last, one learned model with 0.5 nodes overfits to one tiny local of the sample, and another learned model with 0.5 nodes overfits to another tiny local of the sample, and the average of the overfits of learned models with random 0.5 nodes should approximates the expectation of the distribution since the sample is generated by the distribution.
Note:
Here features represent nodes in one layer, middle layers of neural network can be considered as learning features, and the last middle layer can be considered as the last learned features.
Reduce capacity of H:
For neural network, smaller capacity \(\equiv\) less parameters \(\equiv\) smaller VC Dimension \(\equiv\) smaller \(m_H(\epsilon, \delta)\).
When using methods to relieve overfit, \(L_D(A(S)) - L_S(A(S))\) decreases, meanwhile \(L_S(A(S))\) may increase. If after tuning hyperparameters of methods to relieve overfit (e.g. \(\lambda\) of \(l_2\) regularization, number of parameters) to get as good $L_D(A(S)) $ as possible, still \(L_D(A(S))\) is not good enough, this means that with this sample size, we can get at most this good \(L_D(A(S))\), to get better \(L_D(A(S))\), need more examples.
Note:
For neural network, assuming that there are m training examples, if y is numeric, it is well known that m data points can at most decide m parameters, if y is categorical, m points can at most put restriction on m parameters. If there are more parameters than m, such as m+1, m training examples can not put any restriction on the m+1 parameter, namely, the m+1 parameter can be any number without influencing fitting all training examples. But to fit the distribution, it has restriction on the m+1 parameter, so number of parameters can not exceed number of training examples, or the learned model can not fit the distribution well at all since the m+1 parameter is set randomly.
So, we can specify a model H who has finite VC dimension, and $ A(S) = \mathop {argmin} \limits_{h \in H} \ L_D(h)$. As already mentioned, distribution D is unknown, so we can not minimize $L_D(h) w.r.t h\in h $ directly, but we can use \(L_S(A(S))\) to estimate \(L_D(A(S))\) and $ A(S) = \mathop {argmin} \limits_{h \in H} \ L_S(h)$.
 So how to minimize $L_S(h) w.r.t h \in H $ ?
neural network and decision forest are universal approximator. universal approximator is a form of function who can approximate any function to any precision with enough units. We can specify model H is neural network, then, minimize $l_S(h) w.r.t h \in H $ with gradient descent method.
Now, we can minimize $L_S(h) w.r.t h \in H $, but $\mathop {min} \limits_{h \in H} L_S(h) $ is only estimate of $ \mathop {min} \limits_{h \in H} L_S(h) \(, how can we know how well the estimation is? \) L_D(A(S)) = L_D(A(S) - l_S(A(S)) + L_S(A(S)) = L_D(A(S)) - L_V(A(S)) + L_V(A(S)) - L_S(A(S)) + L_S(A(S)) \(, \) L_D(A(S)) - L_V(A(S)) $ can be bounded very small according to Hoeffding inequality. So if $ L_V(A(S)) - L_S(A(S)) $ is small then we can say the estimation is good, and both validation set and training set can be seen. This is just to verify if the estimation is good, So how can we try to get a good estimation when computing A(S) ?
So, how many is $ m_H(\epsilon, \delta) $ ?
H is agnostic PAC learnable with sample complexity:
$ C_1 {d + ln(1/\delta} \over {epsilon^2} \leq m_H(\epsilon, \delta) \leq C_2 {d + ln(1/\delta) \over {\epsilon^2} $
where d is VC Dimension of H, \(C_1\) and \(C_2\) are constants independent on H.
Fix H, \(\epsilon\) (e.g 0.1), \(\delta\) (e.g. 0.1), If we have enough training examples, namely $ m \geq m_H(\epsilon, \delta) \(, then according to definition of agnostic PAC learnable,
\) P[L_D(A(S)) \leq \mathop {min} \limits_{h^* \in H} L_D(h^*) + \epsilon ] \geq (1 - \delta) $
where $ A(S) = \mathop {argmin} \limits_{h \in H} \ L_S(h) $ .
H has a finite VC dimension $\equiv $ "$ A(S) = \mathop {argmin} \limits_{h \in H} \ L_S(h) $" is an agnostic PAC learner for H.
with \(C_1\) and \(C_2\) unknown, how can we how if $ m \geq m_H(\epsilon, \delta) $ ?
How can we know that $ \underset {h^* \in H} min L_D(h^) $ is the minimal L_D ?
The minimal \(L_D(h)\) without restriction on h is gotten when $ y_{infer} = h(X) = E[y|X] $.As capacity of H increases, $ \underset {h^ \in H} min L_D(h^*) $ decreases, and at last it converges to the minimal \(L_D\).
If we have not enough training examples, how can we try to get a good estimation of $\mathop {min} \limits_{h \in H} L_D(h) $ ?
1. What is Machine Learning?
Machine Learning = Statistic Learning, namely, there is an unknown distribution of (x1, x2,..., xm, y), the task is to infer y when (x1, x2,..., xm) is known, Although the distribution is unknown, so the relation between (x1, x2, ..., xm) and y is not known, we can take a sample from the distribution, and get the relation between (x1, x2,..., xm) and y of the sample, since each (x1, x2,..., xm, y) is known in the sample, then use the relation between (x1, x2, ..., xm) and y of the sample to estimate the relation between (x1, x2, ..., xm) and y of the distribution based on some statistic rules.
2. So with (X, y) known in the sample, how to get f:X->y of the sample?
Note: X = vector {x1, x2, ..., xm}
The answer is universal approximator, namely a function with parameters, which can fit any function to any precision with combination of enough units.
Some universal approximators are:
neural network: its unit is one node in one layer, whose mathematical form is
activation(wx + b), where activation is a non-linear function such as relu, sigmoid etc. The purpose of activation is introducing non-linearity in neural network. neural network which has one layer with enough units also can fit any function, but compared to neural network which has multiple layers, it needs more units based on experiences, more units = more parameters, more parameters = larger estimation error, the reason will be illustrated later, so usually neural network with multiple layers is used, It is also called deep learning.
decision forest: its unit is one node in one decision tree.Though one decision tree with enough nodes also can fit any function, using multiple decision trees (namely decision forest) is better in relieving estimation error based on experiences.
polynomial: its unit is \(x^k\).
With one universal approximator with parameters (this is also called a model), compute the parameters by fitting it to the sample, namely, assuming the model is: $ y_{infer} = f(x; w) $, where $ w $ is parameter, $ w = \underset{w} {argmin} |y_{true} - y_{infer}| $. Namely, to get parameter, we just need to minimize $ |y_{true} - y_{infer}| $, this is also called fittig.
Minimization is also called optimization in mathematics. Smooth function, namely who has gradient everywhere, is easy to deal with. $ f(y_{infer} = |y_{true} - y_{infer}| $ is not a smooth function, $f(y_{infer}) = (y_{true} - y_{infer})^2 $ is a smooth function, and they have minimum at the same $ y_{infer}, and $ y_{infer} = f(x; w) $, so also at the same $ w $, so usually the objective function to optimize is $ f(w} = (y_{true} - y_{infer})^2 $. Note, here y is assumed numeric, not categorical, the objective function to optimize for categorical y will be illustrated later. The objective function to optimize is also called loss function in Machine Learning.
3. with $ f: X \rightarrow y $ known for the sample, how to estimate the $ f: X \rightarrow y $ for the distribution?
3.1 PAC (Probably Approximately Correct) learnable
A model H is defined as PAC learnable if there exits a function $ m_H: (0, 1)^2 \rightarrow N $ and a learning algorithm with the following property:
for every $ \epsilon, \delta \in (0, 1) $, for every distribution D over (X, y), when running the learning algorithm on m \geq m_H(\epsilon, \delta) i.i.d examples generated by D, the algorithm returns a $ h: x \rightarrow y $, with probability of at least $ 1 - \delta $ (over the choice of the m training examples), $ L_{D}(h) \leq min\underset{h\in H} L_{D}(h^) + \epsilon $ .
Note:
$ L_D $ : loss of the distribution.
3.2 neural network and decision forest are PAC learnable.
3.2.1 Finite VC dimentsion $\Leftrightarrow $ PAC learnable in binary classification.
definition of VC dimension: VC dimension oF a model H IS the maximum size of a set $ C \subset X $ , that can be shattered by H.
definition of shattering: a model shatters a finite set $ C \subset X $ if the restriction of H to C is all functions from C to {0, 1}.
definition of restriction of H to C: the restriction of H TO C is the set of functions from C to {0, 1} that can be derived from H, namely,
$ H_C = {h(c1), h(c2),..., h(cm)): h \in H} $.
3.2.2 the fundamental theorem of statistical learning:
Let H be a model from a domain X to {0,1}, and let the loss function be the 0-1 loss. Assume that VCdim(H) = d < \infinite, then, there are absolute constants C1, C2 such that:
H is agnostic PAC learnable with sample complexity:
$ C_1{d + log(1/\delta) \div {\epsilon^2} \leq m_H(\epsilon, \delta) \leq C_2 {d + log(1/\delta) \div {epsilon^2} $
Note: Though the upper theorem is discussed in the book for binary classification, based on my experience, it is ok to generalize to y is multiple categorical or numeric.
3.2.3 from the fundamental theorem of statistical learning, it can be seen that sample complexity increase as VC dimension increases. And for neural network and decision forest, VC dimension increases as number of parameters increases.
For neural network, assuming that there are m training examples, if y is numeric, it is known that m points can at most decide m parameters. if y is categorical, m points can at most put restriction on m parameters, so if there are more parameters than m, such as m+1, m training examples can not put any restriction on the m+1 parameter, namely, the m+1 parameter can be any number without influencing fitting training examples, but to fit the distribution, it has restriction on the m+1 parameter, so number of parameters can not be more than number of examples, or the learned model can not fit the distribution well at all since the m+1 parameter is set randomly.
3.2.4 tradeoff between larger model capacity to reduce approximation error and smaller model capacity to reduce estimation error.
$ L_D(A(S)) = L_D(A(S)) - L_D(h^) + L_D(h^) = L_D(A(S)) - l_V(A(S)) + L_V(A(S)) - L_S(A(S)) + L_S(A(S))$
$ L_D(h^) $ is defined as approximation error, is $ min \underset {h in H } l_D(h) \(.
\) L_D(A(S)) - L_D(h^) $ is defined as estimation error.
A(S) is learned model by algorithm H from sample S, in model H.
The goal of Machine Learning is to get small enough $ L_D(A(S)) $, namely both $ l_D(h^) $ and $ L_D(A(S)) - L_D(h^) $ need to be small enough.
So when $ L_D(h^) $ will be small? $ L_D(h^) $ will be small if H contains a function who can fit sample S well.
Of course, the capacity of H larger, the probability of containing a function who can fit sample S well is higher, But, the capacity of H larger, lager capacity of H = larger number of parameters = larger VC dimension = larger sample complexity = with a fixed sample size, and fixed \epsilon, $ \delta = L_D(A(S)) - L_D(h^) $ is larger. So need a good tradeoff between small $ L_D(h^) $ and small $ L_D(A(S)) - L_D(h^) $, namely, need to choose a suitable H capacity, a good way is to make $ L_D(h^) $ small enough first, then if at this point, $ L_D(A(S)) - L_D(h^) $ is large, this means the capacity of H is too large, reduce the capacity little by little to see at which capacity, both $ L_D(h^) $ and $ L_D(A(S)) - L_D(h^*) $ can be small enough, If the good tradeoff which can be gotten still is not good enough result, this means that the sample size is not enough, this sample size can get this good result at most, if want better result, get more training examples.
In general, as long as the approximation error is greater than zero we expect the training error to grow with the sample size, as a larger amount of data points makes it harder to provide an explanation for all of them. On the other hand, the validation error tends to decrease with the increase in sample size. If the VC dimension is finite, when the sample size goes to infinity, the validation and train errors converge to the approximation error. Therefor, by extrapolating the training and validation curves we can try to guess the value of the approximation error, or at least to get a rough estimate on an interval in which the approximation error resides.
$ L_D(A(S)) - l_V(A(S)) $ can be tight bounded according to Hoeffding's ineqality:
Let Z_1,..., Z_m be a sequence of i.i.d random variables, TAssume that $ E[Z] = \miu $ and P(a \leq Z_i \leq b] = 1 for every i. Then, for any \epsilon > 0:
$ P[|1/m \sum_{i=1}^m Z_i - \miu| > \epsilon ] \leq 2exp(-2m\epsilon^2/(b - a)^2) .
This is tighter bound for L_D than the bound for L_D in definition of agnostic PAC learnable, using sample complexity, which can be calculated based on the fundamental theorem of statistical learning.
For example, assuming $ loss \in [0, 1] $, VC dimension d = 100, $\epsilon = 0.1 $, $ \delta = 0.1 $ according to the fundamental theorem of statistical learning:
$ C_1{d + log(1/\delta) \div {\epsilon^2} \leq m_H(\epsilon, \delta) \leq C_2 {d + log(1/\delta) \div {epsilon^2} $
$ C_1 \times 10,230.3 \leq m_H^{agnostic PAC}(\epsilon, \delta) \leq C_2 \times 10,230.3
$ P[L_D(A(S)) \leq l_D(h^*) + \delta] geq {1 - \epsilon}
$ P[[|L_V(A(S)) - L_D(A(S))| \leq \epsilon] \geq {1 - 2exp(-2m\epsilon2/(b-a)2)} $,
if m_V = 300, $ epsilon = 0.1 $, then:
$ P[[|L_V(A(S)) - L_D(A(S))| \leq \epsilon] \geq 0.995 $
Note:
The case of large $ L_D(h^) $ is called underfit, the case of large $ L_D(A(S)) - L_D(h^) $ is called overfit.
3.2.5 methods to relieve overfit for neural network
3.2.5.1 Regularization
Assume that the loss function is $ \beta $-smooth and nonnegative, then, the RLM rule with the regularizer %\lamda||w||^2, where $ \lamda \geq 2\beta \over m, satisfies:
$ E \underset {S~D^m) [L_D(A(S)) - L_S(A(S))] \leq 48 \beta \over {\lamda m} E[L_S(A(S))] $
3.2.5.2 Dropout
Set a fraction (e.g. 0.5) of nodes in one layer to zero randomly at each update and use all nodes in one layer and times 0.5 when inferring. In this way, w updating depends less on coexisting of some features, namely, each feature's effect on w is more independent on other features, and when training, each update using random 0.5 of nodes, at last, one learned 0.5 nodes overfit to one tiny set of individual examples in the sample, and another one 0.5 nodes overfit to another one tiny set of individual examples in the sample, and the average of the overfits of random 0.5 nodes should be near the expectation of the distribution since the examples are generated from the distribution, so their average is the expectation of the distribution.
E[E[y|x] - y|x] = 0, namely, if inferred y is the expectation of y on condition of x known, then the loss is zero.
here features represents nodes in one layer, middle layers of neural network can be considered as learning features, and the last middle layer can be considered as the last learned features.
3.2.5.3 Early Stopping
Gradient Descent Optimization updates W at the direction of negative gradient, since this is the direction the objective function decreases fastest, namely, if fix $ ||\Delta w ||_2 $, the ojective function decreases most in this direction. It fits the main trend of the sample first since the main trend is of most examples, so fitting the main trend decreases the loss (namely objective function of Machine Learning) most,and the main trend is consistent with the distribution since the examples in the sample is generated from the distribution, then it fits the tiny detail of individual examples, which is specific for the sample. not general for the distribution, fitting this leads to overfit, so when it starts to fit this, namely when loss of validation set not improve any more, stopping the optimization is good, this is called early stopping.