Chapter 3: Generalization Theory
Acknowledgment: Most of the knowledge comes from Yuan Yang's course "Machine Learning".
泛化理论想解决一个什么样的问题呢?
已知
- The traditional way is sampling from D again to get a test set and then get
. - We can use theory to get generalization bounds.
The latter is what generalization theory wants to do.
No free lunch theorem
It says that if we don’t have any prior knowledge of the data distribution, we can't design an algorithm that always works well. In other words, there is no universal learner. Well, based on this theorem, even human brains only work in certain scenarios. But this theorem considers a really weird setting(没有先验的学习label), which may not exist in practice.
Theorem 5.1 (no-free-lunch) Let
- There exists a function
with - With probability of at least
over the choice of (that is is training set size ) we have that
That is
首先我们需要明确一下这些符号的意思:
证明前我们先考虑这个条件,假设全集
也就是
这样的
接下来的证明该怎么想呢?根据我们改写的这个概率形式,我们就会很自然的想到markov's inequality:
加上一个我们比较直观的intuition,就是一半的题目我见过我会做,一半的题目我只能蒙,二分类有一半概率蒙对,所以正确率是3/4
接下来的关键就在于证明
对于任意一个
原本是对于不同的训练集求均值,现在是对于不同的标注方式求均值,事情就变得了简单了起来,因为训练元素固定,我们现在可以轻松的去定义,没有出现在训练集中的元素。定义这些元素为
So对所有
不管你学到的
甚至都不用取
Finite hypothesis class and ERM
Class of H, the class of functions can be finite/infinite.
If H is a infinite class,
Realizability assumption
To show this
There exists
Note that this assumption implies that with probability 1 over random samples,
Later we will see that it is not always possible to pick a function that perfectly solves the problem.
corollary 2.3.
It means, that if the problem is realizable, then the ERM hypothesis is good enough. 也就是达到的采样数量,ERM学到的函数就足够好。
Let
Proof:
We would like to bound the probability of the event
Let
But, since the realizability assumption implies that
Let
只有取到
第一个不等式是因为,如果这个S上训出来的
好现在对于一个fix的
最后:
要让
PAC learnable
It’s one kind of learnability. Essentially it tells us whether it is possible to learn a hypothesis class
The lower bound function
definition
A hypothesis class
For every
Finite H is PAC learnable by ERM.
Agnostic PAC learnable
But don’t forget you need a realizable assumption to hold. Realizable is too strong!
The Bayes optimal predictor:
Given any probability distribution, Bayes optimal predictor is the best label predicting function. No other classifier has a lower error.
So realizable assumption fails to hold if the distribution is noisy or weird. If it does not hold, it might be hopeless to have
It’s more general than PAC learnable. 引入了一个数据集本身的的一个loss,就是最高分也达不到满分。
Definition of Agnostic PAC learnable:
A hypothesis class
For every
It is more general than PAC learnable.
VC dimension
Is an infinite hypothesis class
Restriction of
Let
where we represent each function from
Shattering
A hypothesis class
Corollary 6.4. like no free lunch theorem but about shattering
Corollary 6.4 tells us that if H shatters some set
VC-dimension
The VC dimension of a hypothesis class
Theorem 6.6. Let
Theorem 6.8 (the fundamental theorem of statistical learning – quantitative version).
omit here, give a coarse range of
VC dimension is one way to measure sample complexity. But it also has a few drawbacks.
One main problem of the VC dimension is that It is mainly used for classification (due to the shattering notion).
what shall we do for regression? We may use Rademacher complexity
Rademacher complexity
-representative sample
A training set
We define the representativeness of
Now, suppose we would like to estimate the representativeness of
为了使这个式子变得更简单,我们于是引入的rademacher random variable:
Rademacher complexity captures this idea by considering the expectation of the above with respect to a random choice of
So Rademacher complexity of
We can see the the outcome of a hypithesis
Lem 26.6 For any
就是将loss函数放大c倍的时候,R complexity也跟着放大这么多倍。
Lemma 26.7 Let
(证明的关键是
Massart Lemma
Let
Proof: 其实证明有点像Chernoff bound,先
Without loss of generality
通过设置
Contraction Lemma
For each
proof: omit
也是跟loss函数有关,当loss函数再复合一个
Lem26.2
Proof: 关键就是要怎么加上R random variable
So:
We want to prove the last equality. Pick a specific index
然后repeat for 所有的
Thm26.3 相当于将ERM的结果代入前面那个lemma
再次证明了ERM的强大power
Thm 26.5 两个重要公式
Assume that for all
- with
probability, : - with
probability, :
Rademacher complexity of linear classes as an example
Let
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· winform 绘制太阳,地球,月球 运作规律
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· 上周热点回顾(3.3-3.9)