Chapter 1: General Introduction

Acknowledgment: Most of the knowledge comes from Yuan Yang's course "Machine Learning".

监督学习概况

Supervised learning is an important sub-area of machine learning.

Input: \(X = (x_1, x_2, \ldots, x_N)\)

Output: \(Y = (y_1, y_2, \ldots, y_N)\)

我们想要学习一个函数 f(x),使得 \(f(x_i) \approx y_i\)

要检测我们的模型学出来的函数好不好。

We define a loss function \(l\) to measure the distance of prediction \(f(X)\) to \(Y\):

  • For categorical target (e.g., \(y\)=cat, dog, good, bad, classification):

    • \(l(f,x_i, y_i) = 1\) if \(f(x_i) \ne y_i\)
    • \(l(f,x_i, y_i) = 0\) if \(f(x_i) = y_i\)
    • This is good but not differentiable. We will get back to it later.
  • For real number target (e.g., \(y\)=0.1, 0.55, 1.5, regression):

    • \(l(f,x_i, y_i) = 𝑑𝑖𝑠𝑡(f(x_i), y_i)\) For example, L2-norm. \(l(f,x_i,y_i) = (f(x_i) - y_i)^2\)

We can calculate the overall loss \(𝐿(f, X, Y)\) as the average of individual losses for all data points:

\[L(f,X,Y) = \frac{1}{N} \sum_il(f,x_i,y_i) \]

Dataset

Why do we need to build a dataset like nowadays's setting?

  • f is a complicated if-function (memorization):

    • If input = \(x_i\), output \(y_i\). Otherwise, output a random number.
  • We get \(L(f,X,Y) = 0\). But this function \(f\) is useless!

So minimizing loss is not enough, generalization!

验证集(Validation Set): 验证集是用于调整和优化模型的超参数(例如学习率、正则化参数、模型复杂度等)的数据集。在训练过程中,您将训练模型使用训练集,然后使用验证集来评估模型在未见过的数据上的性能。

测试集(Test Set): 测试集是在模型已经通过训练和验证之后,用来评估最终模型性能的数据集。测试集通常是一个完全独立于训练集和验证集的数据集,模型在训练和验证期间没有见过这些数据。

overfit & underfit

  • Underfit: Your function does not have enough representation power (not common these years)
  • Overfit: Your function had too much representation power
    • Overfit easily gives you 𝐿𝑡𝑟𝑎𝑖𝑛 = 0.
    • However, generalization problem: \(L_{test}\) could be very bad!

How to solve overfit

To avoid overfitting. We want to make sure f is “simple”

  • Classical view thinks we should restrict the representation power of f so that it could not overfit! This is called “regularization

  • Modern view

    • Sometimes, explicit regularization is not necessary.

    • For neural networks, there are implicit regularizations to prevent overfitting.

      • 神经网络的隐式正则化是指在模型的训练过程中由于某些因素而自然发生的正则化效应。这些因素包括模型的结构(例如Dropout层、权重共享等)、优化算法(例如SGD、Adam等)、数据增强等。这些因素可以帮助网络更好地泛化到未见过的数据,从而减少过拟合的风险。
    • This is because of the optimization process, specifically the SGD algorithm. We will cover it later.

      • SGD是一种用于调整神经网络权重的优化算法,它在每次迭代中都会随机选择一小批训练样本进行梯度更新,这种随机性可以帮助网络避免陷入局部极小值,并且有一定的正则化效果,减少了过拟合的风险。
    • In other words, although overfitting could lead to bad \(L(f, X, Y)\), it rarely happens.

    • Researchers were afraid of overfitting for neural networks for decades!

无监督学习概况

No labels \(Y\)!What can you do? Learn the distribution of \(X\).

Clustering

But no unique solution! Depends on loss function.

why is it useful?

  • 数据挖掘

    无监督的学习算法可以用于在一个称为聚类的过程中将相似的数据点分组在一起。这在数据挖掘中很有用,因为它可以帮助发现数据集中隐藏的模式或结构。例如,在客户细分中,您可以根据客户的行为或偏好对其进行聚类,以识别不同的细分市场。

  • 加速优化
    降维,可以用于降低数据集的复杂性。这在处理高维数据时尤其有用,因为优化算法可能收敛缓慢或陷入局部极小值。通过降低数据的维度,可以加快优化过程,使其更易于管理。

  • 推荐系统

    • 推荐系统使用无监督学习,为用户提供个性化推荐。一种常见的方法是协作过滤,根据用户或项目的偏好或相似性对其进行聚类。这种聚类可以使用无监督技术来完成,允许系统根据类似用户的偏好向用户推荐项目。
    • 另一种方法是矩阵分解,它可以被视为一种应用于用户-项目交互矩阵的降维技术。通过将该矩阵分解为低维表示,推荐系统可以更有效地预测用户偏好。
  • 总之,无监督学习是一个通用的框架,有助于从数据中提取有用的信息,而不需要标记的例子。它在发现模式的数据挖掘、加快过程的优化以及提供个性化建议的推荐系统中都是有益的。这些只是无监督学习在机器学习和数据分析领域的众多应用中的一小部分。

Principle component analysis(主成分分析)

Find the most important components (directions)

a best-fitting line is defined as one that minimizes the average squared distance from the points to the line. Or, the line with the largest “variance”.

Generative model

The generative model encodes data distribution. We can sample it and get abundant data. But Data distribution is hard to describe.

The generative model did not directly answer the question of “describing a bizarre distribution”. Instead, it maps a Gaussian to the target distribution

Anomaly detection

Dimension reduction

半监督学习

In practice, semi-supervised learning is common:

  • When some data points (say 10%) have labels. Some data points (say 90%) do not have labels.

Can we always use the unlabeled data to improve prediction?

  • Not always! Why? (Y is pure noise)

(Implicitly), we need some assumptions for semi-supervised learning:

  • Continuity assumption: Points close to each other are more likely to share a label. This gives geometrically simple decision boundaries.
  • Manifold assumption: The data lie approximately on a manifold of much lower dimension than the input space.
    • 简而言之,流形是一种数学概念,表示数据在高维空间中的分布具有某种低维结构。例如,如果数据点在高维空间中几乎都位于一个弯曲的低维曲线或曲面上,那么这个假设认为这个曲线或曲面是数据的重要结构。
    • (a)输入空间由多个低维流形组成,所有数据点都位于这些低维流形 (b)位于同一流形上的数据点具有相同的标签。