从matlab官网入门机器学习(记录一些重要的讲义)
机器学习
matlab 官网的机器学习入门其实更加偏向有基础的人去快速了解一遍基本的操作指令,不过有时间的话初学者也可以跟着教程操作一遍,无论是什么小白应该都能看懂,官网的教程主要注重代码的实现,相比之下,函数背后的实现原理少了一点,若是想体验机器学习,不妨试试,主要是免费而且在线环境,会带你做做一个识别手写数字还是字母的案例,比较有意思;
Handwritten letters were stored as individual text files. Each file is comma-delimited and contains four columns: a timestamp, the horizontal location of the pen, the vertical location of the pen, and the pressure of the pen. The timestamp is the number of milliseconds elapsed since the beginning of the data collection. The other variables are in normalized units (0 to 1). For the pen locations, 0 represents the bottom and left edge of the writing surface and 1 represents the top and right edge.
The pen positions for the handwriting data are measured in normalized units (0 to 1). However, the tablet used to record the data is not square. This means a vertical distance of 1 corresponds to 10 inches, while the same horizontal distance corresponds to 15 inches. To correct this, the horizontal units should be adjusted to the range [0 1.5] instead of [0 1].
The time values have no physical meaning. They represent the number of milliseconds elapsed from the start of the data collection session. This makes it difficult to interpret plots of pen position through time. A more useful time variable would be duration (measured in seconds) since the beginning of each letter.
What aspects of these letters could be used to distinguish a J from an M or a V? Instead of using the raw signals, the goal is to compute values that distill the entire signal into simple, useful units of information known as features.
- For the letters J and M, a simple feature might be the aspect ratio (the height of the letter relative to the width). A J is likely to be tall and narrow, whereas an M is likely to be more square.
- Compared to J and M, a V is quick to write, so the duration of the signal might also be a distinguishing feature.
A classification model is a partitioning of the space of predictor variables into regions. Each region is assigned one of the output classes. In this simple example with two predictor variables, you can visualize these regions in the plane.
There is no single absolute “correct” way to partition the plane into the classes J, M, and V. Different classification algorithms result in different partitions.
Having built a model from the data, you can use it to classify new observations. This just requires calculating the features of the new observations and determining which region of the predictor space they are in.
By default, fitcknn
fits a kNN model with k = 1. That is, the model uses just the single closest known example to classify a given observation. This makes the model sensitive to any outliers in the training data, such as those highlighted in the image above. New observations near the outliers are likely to be misclassified.
You can make the model less sensitive to the specific observations in the training data by increasing the value of k (that is, use the most common class of several neighbors). Often this will improve the model's performance in general. However, how a model performs on any particular test set depends on the specific observations in that set.
Typically you will want to apply a series of preprocessing operations to each sample of your raw data. The first step to automating this procedure is to make a custom function that applies your specific preprocessing operations.
Currently, you still need to call your function manually. To automate your data importing and preprocessing, you want your datastore to apply this function whenever the data is read. You can do this with a transformed datastore. The transform
function takes a datastore and a function as inputs. It returns a new datastore as output. This transformed datastore applies the given function whenever it imports data.
Statistical Functions
Measures of Central Tendency
Function | Description |
---|---|
mean |
Arithmetic mean |
median |
Median (middle) value |
mode |
Most frequent value |
trimmean |
Trimmed mean (mean, excluding outliers) |
geomean |
Geometric mean |
harmean |
Harmonic mean |
Measures of Spread
Function | Description |
---|---|
range |
Range of values (largest – smallest) |
std |
Standard deviation |
var |
Variance |
mad |
Mean absolute deviation |
iqr |
Interquartile range (75th percentile minus 25th percentile) |
Measures of Shape
Function | Description |
---|---|
skewness |
Skewness (third central moment) |
kurtosis |
Kurtosis (fourth central moment) |
moment |
Central moment of arbitrary order |
The handwriting samples have all been shifted so they have zero mean in both horizontal and vertical position. What other statistics could provide information about the shape of the letters? Different letters will have different distributions of points. Statistical measures that describe the shape of these distributions could be useful features.
An important aspect of detecting letters written on a tablet is that there is useful information in the rhythm and flow of how the letters are written. To describe the shape of the signals through time, it can be useful to know the velocity of the pen, or, equivalently, the slope of the graph of position through time.
The raw data recorded from the tablet has only position (not velocity) through time, so velocity must be calculated from the raw data. With discrete data points, this means estimating the velocity by using a finite difference approximation
v=Δx/Δt
The pair of signals on the left have a significantly different shape to the pair of signals on the right. However, the relationship between the two signals in each pair is similar in both cases: in the blue regions, the upper signal is increasing while the lower signal is decreasing, and vice versa in the yellow regions. Correlation attempts to measure this similarity, regardless of the shape of the signal.
To automate your feature extraction, you want your datastore to apply your extraction function whenever the data is read. As with preprocessing, you can do this with a transformed datastore.
The MAT-file letterdata.mat
contains the table traindata
which represents feature data for 2906 samples of individual letters. There are 25 features, including statistical measures, correlations, and maxima/minima for the position, velocity, and pressure of the pen.
Use the command classificationLearner to open the Classification Learner app.Select traindata as the data to use.The app should correctly detect Character as the response variable to predict.
Choose the default validation option.
Select a model and click the Train button.
Try a few of the standard models with default options. See if you can achieve at least 80% accuracy.
Note that SVMs work on binary classification problems (i.e. where there are only two classes). To make SVMs work on this problem, the app is fitting many SVMs. These models will therefore be slow to train.
Similarly, ensemble methods work by fitting multiple models. These will also be slow to train.
For any response class X, you can divide a machine learning model's predictions into four groups:
- True positives (green) – predicted to be X and was actually X
- True negatives (blue) – predicted to be not X and was actually not X
- False positives (yellow) – predicted to be X but was actually not X
- False negatives (orange) – predicted to be not X but was actually X
With multiple classes, the false negatives and false positives will be given by a whole row or column of the confusion matrix (except for the diagonal element that represents the true positive).
False Negatives
With 26 letters, you will need to enlarge the confusion chart to make the values visible. If you open the plot in a separate figure, you can resize it as large as you like.
The row summary shows the false negative rate for each class. This shows which letters the kNN model has the most difficulty identifying (i.e., the letters the model most often thinks are something else.)
This model has particular difficulty with the letter U, most often mistaking it for M, N, or V.
Some confusions seem reasonable, such as U/V or H/N. Others are more surprising, such as U/K. Having identified misclassifications of interest, you will probably want to look at some the specific data samples to understand what is causing the misclassification.
Improving the Model
Even if your model works well, you will typically want to look for improvements before deploying it for use. Theoretically, you could try to improve your results at any part of the workflow. However, collecting data is typically the most difficult step of the process, which means you often have to work with the data you have.
If you have the option of collecting more data, you can use the insights you have gained so far to inform what new data you need to collect.
In the handwriting example, volunteers were instructed only to write lower-case letters “naturally”. Investigating the data set reveals that there are often discrete groups within a particular letter, such as a block-letter style and a cursive style. This means that two quite different sets of features can represent the same letter.
One way to improve the model would be treat these variants as separate classes. However, this would mean having many more than 26 classes. To train such a model, you would need to collect more samples, instruct the volunteers to write both block and cursive style, and label the collected data accordingly.
Low accuracy in both your training and testing sets is an indication that your features do not provide enough information to distinguish the different classes. In particular, you might want to look at the data for classes that are frequently confused, to see if there are characteristics that you can capture as new features.
However, too many features can also be a problem. Redundant or irrelevant features often lead to low accuracy and increase the chance of overfitting – when your model is learning the details of the training rather than the broad patterns. A common sign of overfitting is that your model performs well on the training set but not on new data. You can use a feature selection technique to find and remove features that do not significantly add to the performance of your model.
You can also use feature transformation to perform a change of coordinates on your features. With a technique such as Principal Component Analysis (PCA), the transformed features are chosen to minimize redundancy and ordered by how much information they contain.
The Classification Learner app provides an easy way to experiment with different models. You can also try different options. For example, for kNN models, you can vary the number of neighbors, the weighting of the neighbors based on distance, and the way that distance is defined.
Some classification methods are highly sensitive to the training data, which means you might get very different predictions from different models trained on different subsets of the data. This can be harnessed as a strength by making an ensemble – training a large number of these so-called weak learners on different permutations of the training data and using the distribution of individual predictions to make the final prediction.
For the handwriting example, some pairs of letters (such as N and V) have many similar features and are distinguished by only one or two key features. This means that a distance-based method such as kNN may have difficulty with these pairs. An alternative approach is to use an ensemble approach known as Error-Correcting Output Coding (ECOC) which use multiple models to distinguish between different binary pairs of classes. Hence, one model can distinguish between N and V, while another can distinguish between N and E, and another between E and V, and so on.
When trying to evaluate different models, it is important to have an accurate measure of a model's performance. The simplest, and computationally cheapest, way to do validation is holdout – randomly divide your data into a training set and a testing set. This works for large data sets. However, for many problems, holdout validation can result in the test accuracy being dependent on the specific choice of test data.
You can use k-fold cross-validation to get a more accurate estimate of performance. In this approach, multiple models are trained and tested, each on a different division of the data. The reported accuracy is the average from the different models.
Accuracy is only one simple measure of the model's performance. It is also important to consider the confusion matrix, false negative rates, and false positive rates. Furthermore, the practical impact of a false negative may be significantly different to that of a false positive. For example, a false positive medical diagnosis may cause stress and expense, but a false negative could be fatal. In these cases, you can incorporate a cost matrix into the calculation of a model's loss.
Machine Learning
!!!🌹💮