Deep Learning Flower Book 4

Machine Learning Basics


Learning Algorithm


"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, inproved with experience E"

The Task, T


Machine learning tasks are usually described in terms of how the machine learning system should process an example. An example is a collection of features that have been quantitatively measured from subjects or events that we want the machine learning system to process. We typically represent an example as an multiple vector.

Some of the most common machine learning tasks include the following:

  • Classification: In this type of task, the computer program is asked to specify which of k categories some input belong to.
  • Classification with missing inputs: To solve the classification task, the learning algorithm must learn a set of functions. Each function corresponds to classifying x with a different subset of its input missing. One way to efficiently define such a big set of functions is to learn a probability distribution over all the relevant variables, then solve the problem by marginalizing out the missing components. With n input variables, the computer only needs to learn a single function describing the joint probability distribution.
  • Regression: In this type of work, computer needs to predict a numerical value given some input.
  • Transcription: In this type of task, the learning algorithm is asked to observe a relatively unconstructed representation of some kind of data and transcribe the information into discrete textual form.
  • Machine translation: In a machine translation task, the input already consists of a sequence of symbols in some language, and the computer program must convert this into a sequence of symbols in another language.
  • Structed output: Structed output task involve any task where the output is a vector (or other data structure containing multiple values) with important relationship between the different elements.
  • Anomaly detection: In this type of task, the computer program sifts from a set of events or objects and flags some of them as unusual or atypical.
  • Synthesis and sampling: In this type of task, the machine learning algorithm is asked to generate new examples that are similar to those in the training data.
  • Imputation of missing values: In this type of task, the machine learning algorithm must provide a prediction of the values of the missing entries.
  • Denosing: In this type of task, the computer program is given in a corrupted example obtained by an unknown corruption process from a clean example.
  • Density estimation / Probability mass function estimation: The machine learning algorithm is asked to learn a function that can be interpreted as a probability density estiamtion or probability mass function on the space that the examples were drawn from.

The Perfomance measure


We usually design a quantitative measure which is specific to the task to evaluate the the abilities of the machine learning algorithm.

Usually we are interested in how well the machine learning algorithm performs on data that it has not seen before, since this determine how well it will pefrorm when it is deployed in the real world. We therefore use a test set of data that is separated from the data used for training the machine learning system.

In some cases, it is difficult to determine what should be measured. In other cases, measuring the criterion is impratical.

The experience E


Machine learning can broadly categorized as supervised and unsupervised by what kind of experience they are allowed to have during the learning process.

Unsupervised learning algorithms experience a dataset containing many features, then learn useful properties of the structure of this dataset.

Supervised learning algorithms experience a dataset containing features, but each example also associated with a label or target.

Roughly speaking, unsupervised learning involves observing several examples and attempt to implicitly or explicitly learn the probability distribution or some properties about it. And supervised learning involves observing a random example and attach it to another vector which is provided by a instructor who show the machine learning system what to do.

Linear Regression


We predict the value y should be take on with the input vector x in this way:

Suppose we have a design matrix of m example inputs that we will not use for training, only for evaluate how well our the model performs. We call it the test set. One way of measuring the performance of the model is to compute the mean squared error of the model on the test set.

To minimize the MSE, we have to set the model, also the weight vector, in this way:

The system of equation whose solution is given by this is known as the normal equation. It is worth nothing that the term linear regression is often used to refer to a slightly more sophiscated model with one additional parameter -- an intercept term b. Now the mapping from features to predictions is now an affine function.

Capacity, Overfitting and Underfitting


The central challenge in machine learning is that our algorithm must perform well on new, previously unseen inputs -- not just those on which our model was trained. The ability to perform well on previously unobserved inputs is called generalization.

What separates machine learning from optimization is that we want the generalization error, also called the test error, to be low as well.

If the training and the test set are collected arbitrarily, there is indeed little we can do. If we are allowed to make some assumptions about how the training and test set are selected, then we can make some progress.

The training and test data are generated by a probability distribution over datasets called the data-generating process. We typically make a set of assumptions known collectively as the i.i.d. assumptions. These assumptions are that the examples in each dataset are independent from each other, and that the training set and the test set are identically distributed, drawn from the same probability as each other.

The probabilistic framework and the i.i.d. assumptions enable us to find the relationship between the test error and the training error mathematically.

When we use a machine learning algorithm, we do not fix the parameters ahead of time, then sample both datasets. We sample the training set, then use it to choose the parameters to reduce training set error, then sample the test set. Under this process, the expected test error is greater than or equal to the expected value or training data. The factors determining how well a machine learning algorithm will perform are its ability to:

  • Make the training error smaller.
  • Make the gap between training data and test data smaller.

These two factors correspond to two central challenges in machine learning: underfitting and overfitting.

  • Underfitting occurs when the model is not able to obtain a sufficiently low error value on the training set.
  • Overfitting occurs when the gap between test error and training error is too large.

We can control whether a model is more likely to be overfitting or underfitting by altering its capacity. Informly, a model's capacity refers to the ability to fit a variety of functions. Models with low capacity will struggle to fit the training set. Models with high capacity can overfit by memorizing properties of the training set that do not serve them well on the test set.

The error incurred by an oracle making predictions from the true distribution is called Bayes error.

The No Free Lunch Theorem


Averaged over all possible data-generating distributions; every classification algorithms has the same error rate when classifying previously unobserved points.

Fortunately, these results hold only when we average all possible data-generating distributions.

Regularization


The behavior of our algorithm is strongly affected not just by how large we make the set of functions allowed in its hypothesis space, but by the specific identity of those functions.

We can also give a learning algorithm a preference for one solution over another in its hypothesis space. For example, we can modify the training criterion for linear regression to include weight decay. To perform linear regresion with weight decay, we minimize a sum comprising both the mean squared error on the training and a criterion \(J(\omega)\) that expresses a preference for the weights to have smaller squared \(L^2\) norm. Specifically:

\[J(\omega)= MSE_train + \lambda \omega^T \omega \]

Where \(\lambda\) is a value chosen ahead of time that controls the strength of our preference for smaller weights. When \(\lambda = 0\), we impose no preference, and larger \(\lambda\) forces the weights to become smaller.

More generally, we can regularize a model that learns a function \(f(x;\theta)\) by adding a penalty called regularizer to the cost function.

Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.

Hyperparameters and Validation Sets


It is important that the test examples are not used in any way to make choices about the model, including its hyperparameters. For this reason, no example from the test set can be used in the validation set. Therefore, we always construct the validation set from the training data. Specifically, we split the training data into two disjoint subsets. One of these subsets is used to learn the parameters. The other subset is our validation set, used to estimate the generalization error during or after training, allowing for the hyperparameters to be updated accordingly

Cross-Validation


A small test set implies statistical uncertainty around the estimated average test error, making it difficult to claim that algorithm A works better than algorithm B on the given task.

Estimators, Bias and Variance


Point Estimation


Point estimation is the attempt to provide the single “best” prediction of some quantity of interest. In general the quantity of interest can be a single parameter or a vector of parameters in some parametric model.

posted @ 2019-12-07 14:52  IdiotNe  阅读(152)  评论(0编辑  收藏  举报