Chapter 4: Supervised Learning
Acknowledgment: Most of the knowledge comes from Yuan Yang's course "Machine Learning".
Linear Regression
就是一个很传统的统计学任务。
用最小二乘法可知,
We can use gradient descent. Define the square loss:
What if the task is not regression but classification?
Consider binary classification at first. A naive way is to use positive and negative to represent 2 classes. However, the gradient can not give meaningful gradient information.
方法一:没有导数我就设计一种不需要导数信息的算法——感知机
方法二:把硬的变软,人为创造出导数——logistics regression
Perceptron
Intuition: adding
Limitation: it can only learn linear functions, so it does not converge if data is not linearly separable.
Update rule: Adjust
If
combine these 2 into if
Convergence Proof:
Assume
And
Start from
Then
On the other hand:
Telescoping:
So
Logistic Regression(逻辑回归)
将分类问题转化为概率的回归问题。
Instead of using a sign function, we can output a probability. Here comes the important thought we have used in matrix completion. Relaxation!!!
Make the hard function
It remains to define a loss function. L1 and L2 are both not good enough, let's come to cross entropy:
Explanation:
-
We already know the actual probability distribution
. -
We estimate the difference of
and .
feature learning
Linear regression/classification can learn everything!
- If the feature is correct
In general, linear regression is good enough, but feature learning is hard
Deep learning is also called “representation learning” because it learns features automatically.
- The last step of deep learning is always linear regression/classification!
Regularization
This is a trick to avoid overfitting.
Ridge regression
This is
An intuitive illustration of how this works. For each gradient descent step, split it into two parts:
- Part 1: Same as linear regression
- Part 2: “Shrink” every coordinate by
(weight decay is a very important trick today)
Until two parts “cancel” out, and achieves equilibrium.
However, ridge regression can not find important features. It is essentially linear regression + weight decay.
Although the
If we need to find important features, we need to optimize:
Then relax "hard" to "soft" as we always do. Then we do LASSO regression:
LASSO regression
An intuitive illustration of how this works. For each gradient descent step, split it into two parts:
- Part 1: Same as linear regression
- Part 2: for every coordinate
:
Until two parts “cancel” out, and achieves equilibrium.
Compressed Sensing
Definition of RIP condition:
Let
This is called the Restricted Isometry Property (RIP). It means
Without the sparsity condition, this is impossible (think about
Naïve application for RIP (Theorem 1)
Theorem: Let
be a reconstructed vector. Then
x is the only vector that gives y as the result after applying W under sparse conditions.
proof: 反证法 注意到是
If not, i.e.,
Notice that
这个定理还是有一样的问题,0阶不好优化,需要relax到一阶
Theorem2: 0-norm 1-norm
Let
Theorem 2 + x contains noise (Theorem 3)
Let
Let
That is,
Let
where
Proof:
Before the proof, we need to clarify some notations.
Given a vector
We partition indices as:
Then based on our notation,
Then begin our proof:
Let
We split
Step1: bound
The first inequality is by definition of infinity norm. Sum over
We want to show
解释一下这个式子(*): For any
Then we want to bound
So:
因为我们等一下也要bound
总结一下,经过step1之后,我们得到了什么
Step2: Then we want to bound
Use RIP condition bound these big terms:
The last equality is because
The final result is the inner products between disjoint index sets. We use a very useful lemma:
Theorem: Let
proof of this small thm:
WLOG assume
We can bound these two terms by RIP condition:
Since
Now come back to the original proof.
Therefore, we have for
Continue:
Continue: 设
然后我们把这个再代入前面的式子:
这样就证完了。
How to construct a RIP matrix?
Random matrix
Theorem 4: Let
Let
解释:
-
Matrix
can be identity, so is also -RIP. -
Before we delve deeper into this proof, we discuss a question: Why is
useful? , is RIP, but can not always be sparse.- But also
, is still RIP, is orthonormal, is sparse. 也就是可以加一个任意的正交的矩阵,将不sparse的输入变成sparse。
The rough idea of this proof is:
RIP condition holds for all possible (infinitely many) sparse vectors, but we want to apply union bound. So:
-
Continuous space
finite number of points -
Consider a specific index set
of size (get this RIP condition on this specific index set) -
Use this index set to enter the sparse space
-
Apply union bound over all possible
Let's begin our proof:
Lemma 2:
Let
意义:
我们能够用有限的点,覆盖一个高维空间(此处覆盖的意思是,高维空间中的任意一点,到我们取的有限的点集的最小距离,都很小)。也就是对应我们idea中的第一步——"Continuous space
Proof:
假设
也就是
Clearly,
We shall set
JL lemma:
Let
Then, with probability of at least
意义:
-
这个定理已经就和我们最后要证的RIP的条件几乎一模一样了。只不过这里x不是sparse,而是在finite set中取。也就是对应我们idea中的第二步——"Consider a specific index set
of size " -
n的取值大小,和维数d没有关系了,只和finite set的size d有关,那么我们就可以用一个很小的Q和一个维数很低的n*d的矩阵W来做操作。
proof: omit here.
Lemma 3:
Let
Then, with probability of at least
意义:
lemma3已经非常接近我们最后要证的东西了,也就是对应idea中的第三步——"Use this index set to enter the sparse space"
首先解释一下:
If Lemma 3 is true, we know for any
Which implies that
因为a是unit length,所以我们只需要bound
只要lemma3正确,然后做第四步——"apply union bound over all possible
proof:
大致思路已经很清晰了,首先用JL lemma bound住离散的点
It suffices to prove the lemma for all
We can write
Using Lemma 2 we know that there exists a set
Since
也就是:
Apply JL lemma on
We can have:
This also implies
Let
Notice that
By definition of
There may be some problem with the left half in PPT. TODO
Between this lemma 3 and the thm may be a gap, but 老师上课没讲。
作业中还有提到用random orthonormal basis 来构造RIP matrix的,这个在后面decision tree中会出现。
Support Vector Machine
概念解释
支持向量机(SVM)是一种二分类模型,它的基本模型是定义在特征空间上的间隔(margin)最大的线性分类器。
Margin: distance from the separator to the closest point
Samples on the margin are "support vectors".
数学表达式:
找到一个超平面
If the space is perfectly linear separable, this can be solved using quadratic programming in polytime.
那如果不是线性可分的呢?
Naïve answer:
i.e.
This is NP-Hard to minimize.
The indicator function is hard to optimize. Make it soft.
Relax the hard constraint:
"Violating the constraint" a little bit is OK. SVM with "Soft" margin.
This is called SVM with a "Soft" margin.
计算
Then we use "Dual" to solve SVM:
线性可分的情况:
Primal:
Dual:
Take derivative:
Plug it into
The relaxed case
Primal:
The Lagrangian:
where
Take derivative, we get the optimality conditions:
So,
Then we put all the solution into
The last formulation is the same as the linearly separable case.
kernel trick
We can use a kernel function to transform non-linearly separable into high dimensional space, in which the data is linearly separable.
Two problems:
-
kernal function
dim too high, hard to compute. -
The separator has a higher dim, hard to compute.
For
- Only need to do
kernel computations
When a new data point
Assume support vectors, then we only need to do at most kernel computations
Therefore, no need for computing
Move a step further, do we need to define
Mercer’s Theorem: If the kernel matrix is positive semidefinite for any
That means we may define a
Decision Tree
pro: good explanations
con: big and deep, hard to switch, overfit easily.
前置知识
boolean function analysis:
Fourier basis of boolean function:
A low degree means the degree of all its terms is bounded by this number. 也就是
Convert decision tree to low-degree sparse function
Thm: for any decision tree
Proof:
Step 1: bounded its depth. Truncating
This means there are at most
It differs by at most
(注意:这里已经造成了
So below we assume
Step 2: bounded its degree and L1 norm
A tree with
这是为什么呢?以下面这个图为例,假如我想要走到其中一个node,比如
每个node都可以这样表示:
那么这个函数的各个系数分别是多少,可以用这个公式
Because every AND term has
Step 3
We need to prove:
For
Proof:
Let
这样的term最多只有
By Parseval identity, the missing term has a contribution at most:
这样误差的要求也满足。
上下两个误差相结合即可。
好了,现在我们已经把学一个决策树的问题转化为学一个函数的问题,而且这个函数在bool function basis上是low degree and sparse. 我们要怎么求这个函数在这组基分解上的系数呢?
Theoretically analysis
KM algorithm: (Not required)
Key point: recursively prune the less promising set of basis and explore a promising set of basis.
以一个由
All these functions are well defined and also satisfy parseval identity.
e.g.
Algorithm process:
def Coef(a):
If
If
Else Coef(a0); Coef(a1) // 感觉这也太暴力了吧qwq
大概就是只留下,系数比therohold大的节点,但是尽可能延展,然后又不会太长的节点。
当然这里大家肯定有疑问,我们要怎么样计算
Lem3.2:
For any function
大概就是说,除了
This formulation implies that even though we do not know how to compute the value of
这个算法有two problem:
- Pretty slow
- Sequential algorithm, cannot be done in parallel
LMN Algorithm:
For every
Do it for all
sample complexity is small and parallelizable.
Two problems:
- does not work well in practice
- does not have guarantees in the noisy setting.
From the point of view compressed sensing, we can control the error, because Fourier basis
Harmonica: compressed sensing
这个boolean function analysis是一个orthogonoal random matrix. 可以做compressed sensing 来复原
In practice, how to build a decision tree? Gini index
For a specific node, we have to distinguish a set of variables. Then we need to determine the next node.
We use the Gini index to measure the uncertainty of a variable relative to the decision outcome.
We pick the variable with the smallest Gini index/uncertainty.
Theoretically, if the decision tree is super deep, it can fit anything. But overfitting. We should avoid it, how? random forest.
Random forest (one of the best tools in practice)
Bagging
- Bagging是并行式集成学习方法最著名的代表。它直接基于自助采样法(bootstrap sampling)
随机森林是Bagging的一个扩展变体。
课件上面是说,我有放回的从原本的训练集中sample n次,构造一个大小一样的,但是(可能)有重复元素的集合。就相当于一个带权的子集。然后在这上面跑构造decision tree的算法。
重复B次,得到B个tree,然后取平均。
这样子的话,得到的tree会更稳定一些。
这还只是data bagging. 我们还可以做feature bagging,就是排除某些特征,来构造tree。这样也会更好。
• Each tree can only use a random subset of features
• So the decision tree is forced to use all possible information to predict,
instead of relying on a few powerful features
Boosting
We may have lots of weak learners. We want to combine them together to get a strong learner?
That is we want to learn a fixed combination of
Adaboost
key idea: we can set the weight of each sample to control.
Constructing:
init:
Given
Convergence Thm
Adaboost is adaptive: does not need to know
Convergence Proof
Step 1: unwrapping recurrence
Let
Step 2: bound error of H
Step 3: bound
对
因为
From this Thm, we can know when
Margin-based analysis
Thm1: Let
Define
We allow the same
Any majority vote hypothesis
这边的思想还是很妙的。就是对于我的每一个函数f,我都构造一个函数集合的概率分布,这个函数集合的多个函数求均值等于f的值。
那么如何构造这个概率分布呢?
Our goal is to upper bound the generalization error of
As this inequality holds for any
We first look at the second term:
Since
Then we come to the first term:
Change a little bit of the statement, with a probability
现在我们其实想bound的是,在已知
To upper bound the first term, we use the union bound.
For any signle term,
那么这个不等式成立的概率该怎么算呢?chernoff hoeffding bound
上面这个式子中的
然后遍历所有的
接着再对这个
最后这个式子的概率,应该小于
要保证最后这个不等是的话,由于
类似前面这个式子
我们有在
By chernoff hoeffding bound:
Combine them together, we can get for every
将
最后再设置一下N的值,和
Modification on Adaboost
Thm2:
Suppose the base learning algorithm, when called by AdaBoost, generates hypotheses with weighted training errors
Proof:
Note that if
So:
因为
这也是随着T指数级减小的。
Gradient Boosting
One drawback of adaboost: only cares about binary classification tasks.
What about regression tasks? We want to extend adaboost to regression tasks. Then we get gradient boosting.
Recall Adaboost is optimizing the following function:
- This will go to 0 as T goes up
- How does it optimize? It uses coordinate descent
我们的需求是,给定了一些weak learners,我们要学一个他们的权重。
求一个全导数做gradient descent太贵了,因为weak learner数量可能是非常多的。
那就每次先固定所有其他的
process:
- initially all
- In each iteration of adaboost, we pick a coordinate
and set it to make loss decrease most. - It is like each time, we add a weak learning
into
因为我们每次是往
How is it related to gradient descent?
假如我们定义要优化的loss是
如果做gradient descent,
但是实际上,我每次不是更新
More generally, other losses are the same.
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· TypeScript + Deepseek 打造卜卦网站:技术与玄学的结合
· 阿里巴巴 QwQ-32B真的超越了 DeepSeek R-1吗?
· 【译】Visual Studio 中新的强大生产力特性
· 10年+ .NET Coder 心语 ── 封装的思维:从隐藏、稳定开始理解其本质意义
· 【设计模式】告别冗长if-else语句:使用策略模式优化代码结构