What is fit and overfit
- In statistics, goodness of fit refers to how closely a model's predicted values match the observed(true) values
- overfit: A model that has learned the noise instead of the signal is considered "overfit" because it fits the training dataset but has poor fit with new dataset.
- Underfitting occurs when a model is too simple —— informed by too few features or regularized too much(由于太少的特征或太多的正则化导致)-----which makes it inflexible in learning the dataset.
- simple learners tend to have less variance in their predictions but more bias towards wrong outcome; On the other hand, complex learners tend to have more variance in their predictions. Both bias and variance are forms of prediction error in machine learning. Typically, we can reduce error from bias but might increase error from variance as a result, or vice versa.
This trade-off between too simple (high bias) vs. too complex (high variance) is a key concept in statistics and machine learning, and one that affects all supervised learning algorithm -
How to detect a overfitting
5.1 We can split our initial dataset into separate training and test subsets
5.2 Another tip is to start with a very simple model to serve as a benchmark (基准),Then, as you try more complex algorithms, you’ll have a reference point to see if the additional complexity is worth it.