Chapter 3: Generalization Theory

Acknowledgment: Most of the knowledge comes from Yuan Yang's course "Machine Learning".

泛化理论想解决一个什么样的问题呢?

已知Ltrain=ϵ, what can we say on LD (population loss)?

  1. The traditional way is sampling from D again to get a test set and then get Ltest.
  2. We can use theory to get generalization bounds.

The latter is what generalization theory wants to do.

No free lunch theorem

It says that if we don’t have any prior knowledge of the data distribution, we can't design an algorithm that always works well. In other words, there is no universal learner. Well, based on this theorem, even human brains only work in certain scenarios. But this theorem considers a really weird setting(没有先验的学习label), which may not exist in practice.

Theorem 5.1 (no-free-lunch) Let A be any learning algorithm for the task of binary classification with respect to a loss within [0,1] over a domain X. Let m be any number smaller than |X|/2, representing a training set size. Then, there exists a distribution D over s.t.

  1. There exists a function f:X0,1 with LD(f)=0
  2. With probability of at least 17 over the choice of SDm (that is S is training set size m) we have that LD(A(S))18

That is

PSDm(LD(A(S))18)17

首先我们需要明确一下这些符号的意思:

X 是原始数据的集合。 X×{0,1}是一个二分类的数据集空间,是数据集和标签集的直积。D 是对X×{0,1}这个数据集空间进行有概率地采样。也就是一个带标签的数据集。只有一些数据集是合法的,由条件1来控制。也就是至少要有一个f在这个数据集上能做到满分。

证明前我们先考虑这个条件,假设全集X大小为2m,那么就有T=22m 种合法的函数(没有限制)能够把X 映射到 {0,1} 这个二分类标签集。这些函数记作{f1,...,fT}, 由这些函数定义的D 记作{D1,...,DT}

Di(xj,fi(xj))=12m,xjX

也就是Di 里面有2mfi标注的数据,每条等概率。

这样的TD上都满足条件(1),也就是存在一个全对的函数。

接下来的证明该怎么想呢?根据我们改写的这个概率形式,我们就会很自然的想到markov's inequality:

Pr(x>a)E|x|a

加上一个我们比较直观的intuition,就是一半的题目我见过我会做,一半的题目我只能蒙,二分类有一半概率蒙对,所以正确率是3/4

ESDm[1LD(A(S))]34Pr(1LD(A(S))>78)ESDm[1LD(A(S))]7/867Pr(LD(A(S))<18)67Pr(LD(A(S))>18)17

接下来的关键就在于证明ESDm[1LD(A(S))]34:

对于任意一个Di, 我都可以在其中取出k=(2m)m次方个不同的训练集,记作{S1,...,Sk}. 为了表示这个是从Di中取出的,或者说是由fi 标注的,加上上标i, 记作{S1i,...,Ski}

maxi[T]ESDim(LDi(A(S)))=maxi[T]1kj=1kLDi(A(Sji))1Ti=1T1kj=1kLDi(A(Sji))minj[k]1Ti=1TLDi(A(Sji)) max>均值>min

原本是对于不同的训练集求均值,现在是对于不同的标注方式求均值,事情就变得了简单了起来,因为训练元素固定,我们现在可以轻松的去定义,没有出现在训练集中的元素。定义这些元素为 {v1,...,vp} 是没有出现在训练集中的元素. 因为有可能训练集多次取到了相同的元素,所以总数小于等于m, pm:

LDi(h)=12mxC[h(x)fi(x)]12mr=1p1[h(vr)fi(vr)]12pr=1p1[h(vr)fi(vr)]

So对所有Di 求均值我们能得到:

1Ti=1TLDi(A(Sji))1Ti=1T12pr=1p1[(A(Sji))(vr)fi(vr)]=121pr=1p1Ti=1T1[(A(Sji))(vr)fi(vr)]12minr[p]1Ti=1T1[(A(Sji))(vr)fi(vr)]

不管你学到的(A(Sji))(vr)是什么, 遍历所有 fi(vr), 肯定是一半概率错一半概率对,所以r 我们有:

1Ti=1T1[(A(Sji))(vr)fi(vr)]=12

甚至都不用取min就可以得到:

maxi[T]ESDim(LDi(A(S)))1Ti=1TLDi(A(Sji))121Ti=1T1[(A(Sji))(vr)fi(vr)]=14

Finite hypothesis class and ERM

Class of H, the class of functions can be finite/infinite.

ERMH: pick the hypothesis with the smallest training loss

ERMH(S)argminhHLS(h)

If H is a infinite class, ERMH may not give the best solution because of Overfitting (memorization).

Realizability assumption

To show this ERM algorithm is powerful, we raise the Realizability assumption:

There exists hH s.t. LD,f(h)=0.

Note that this assumption implies that with probability 1 over random samples, S, where the instances of S are sampled according to D and are labeled by f, we have LS(h)=0

Later we will see that it is not always possible to pick a function that perfectly solves the problem.

corollary 2.3.

It means, that if the problem is realizable, then the ERM hypothesis is good enough. 也就是达到的采样数量,ERM学到的函数就足够好。

Let H be a finite hypothesis class. Let δ{0,1}and ϵ>0 and let m be an integer that satisfies mlog(|H|δ)/ϵ. Then, for any labeling function f and for any distribution D, for which the realizability assumption holds (that is, for some hH, LD,f(h)=0), with probability of at least 1δ over the choice of an iid sample S of size m, we have them for every ERM hypothesis, hS, it holds that:

LD,f(hS)ϵ

Proof:

We would like to bound the probability of the event LD,f(hS)>ϵ

Let HB be the bad hypothesis, that is:

HB={hH:LD,f(h)>ϵ}

But, since the realizability assumption implies that LS(hS)=0, 所以ERM取到的bad hypothesis 也要在S上拿到满分。

Let M be the training set that can train a bad hypothesis:

M={S:\existhHB,LS(h)=0}

只有取到M 中的S, 才有可能训出这样的bad hypothesis

Dm({S:LD,f(hS)>ϵ})Dm(M)=Dm(hHB{S:LS(h)=0})  (定义)hHBDm({S:LS(h)=0})  (union bound)

第一个不等式是因为,如果这个S上训出来的hS不好,那起码这要是一个能训出来bad hypothesis的S,小于号因为是必要条件。

好现在对于一个fix的hHB, 我们来选S 中的元素。D中有大于ϵ的元素是h会预测错的,所以不能选。只能选剩下的的那1ϵ

D({xi:h(xi)=yi})=1LD,f(h)1ϵ

Dm({S:hS(h)=0})=Dm({xiS:h(xi)=yi})(选中一些他会的例子)=i=1mD({xi:h(xi)=yi})(独立事件)(1ϵ)mem

最后:

hHBDm({S:hS(h)=0})|HB|em|H|em(这也就是为什么要finite)

要让|H|emδ, 解得:

mlog(|H|δ)ϵ

PAC learnable

It’s one kind of learnability. Essentially it tells us whether it is possible to learn a hypothesis class H

The lower bound function mH, tells you how many samples you need, and then if you have this many samples, you can find some algorithm to train on these samples, and the outcome, is good on the population distribution.

definition

A hypothesis class H is PAC learnable if there exist a function mH(ϵ,δ) and a learning algorithm(not essentially ERM) with the following property:

For every ϵ,δ(0,1), for every distribution D over X, and for every labeling function f:X{0,1}, if the realizable assumption holds with respect to H,D,f, then when running the learning algorithm on mmH(ϵ,δ) iid examples generated by D and labeled by f, the algorithm returns a hypothesis h such that, with probability of at least 1δ (over the choice of the examples):

LD,f(h)ϵ

Finite H is PAC learnable by ERM.

Agnostic PAC learnable

But don’t forget you need a realizable assumption to hold. Realizable is too strong!

The Bayes optimal predictor:

f(x)={1    P(y=1|x)120    otherwise 

Given any probability distribution, Bayes optimal predictor is the best label predicting function. No other classifier has a lower error.

So realizable assumption fails to hold if the distribution is noisy or weird. If it does not hold, it might be hopeless to have LD,f(h)ϵ. But that’s fine, we may have the adjusted version of PAC learning.

It’s more general than PAC learnable. 引入了一个数据集本身的的一个loss,就是最高分也达不到满分。

Definition of Agnostic PAC learnable:

A hypothesis class H is Agnostic PAC learnable if there exist a function mH(ϵ,δ) and a learning algorithm(not essentially ERM) with the following property:

For every ϵ,δ(0,1), for every distribution D over X, then when running the learning algorithm on mmH(ϵ,δ) iid examples generated by D and labeled by f, the algorithm returns a hypothesis h such that, with probability of at least 1δ (over the choice of the examples):

LD(h)minhHLD(h)+ϵ

It is more general than PAC learnable.

VC dimension

Is an infinite hypothesis class H learnable? The answer can be yes. But how can we distinguish the infinite learnable H and infinite unlearnable one? We need to introduce a VC dimension to help us.

Restriction of H to C

Let H be a class of functions from X to {0,1}, and let C={c1,,cm}X. The restriction of H to C is the set of functions from C to {0,1} that can be derived from H. That is,

HC={hc1,,hcm:hH}

where we represent each function from C to {0,1} as a vector in {0,1}|C|.

Shattering
A hypothesis class H shatters a finite set CX if the restriction of H to C is the set of all functions from C to {0,1}. That is |HC|=2|C|

Corollary 6.4. like no free lunch theorem but about shattering

Corollary 6.4 tells us that if H shatters some set C of size 2m then we cannot learn H using m examples. Intuitively, if a set C is shattered by H, we only contain half of the instances, the labels of these instances give us no information about the labels of the rest of the instances in C

VC-dimension

The VC dimension of a hypothesis class H is the maximal size of a set C that can be shattered by H. If H can shatter sets of arbitrarily large size we say that H has infinite VC-dimension.

Theorem 6.6. Let H be a class of infinite VC-dimension. Then, H is not PAC learnable.

Theorem 6.8 (the fundamental theorem of statistical learning – quantitative version).

omit here, give a coarse range of mH(ϵ,δ)

VC dimension is one way to measure sample complexity. But it also has a few drawbacks.

One main problem of the VC dimension is that It is mainly used for classification (due to the shattering notion).

what shall we do for regression? We may use Rademacher complexity

Rademacher complexity

ϵ-representative sample

A training set S is called ϵ - representative s.t.

suphH|LD(h)LS(h)|ϵ

We define the representativeness of S with respect to F as the largest gap between the true error of a function f and its empirical error:

RepD(F,S)=supfF(LD(f)LS(f))

Now, suppose we would like to estimate the representativeness of S using the sample S only. One simple idea is to split S into two disjoint sets, S=S1S2, refer to S1 as a validation set and to S2 as a training set. We can then estimate the representativeness of S by:

supfF(LS1(f)LS2(f))

为了使这个式子变得更简单,我们于是引入的rademacher random variable:
(σ1,σ2,...,σm){1,+1}m to be a vector s.t. S1={zi=(xi,yi):σi=1} and S2={zi=(xi,yi):σi=1}. If we assume E[σi]=0, that is |S1|=|S2| then:

supfF(LS1(f)LS2(f))=2msupfFi=1mσif(zi)

Rademacher complexity captures this idea by considering the expectation of the above with respect to a random choice of σ.
So Rademacher complexity of 𝐹 with respect to S is defined as follows:

R(FS)=1mEσ[supfFi=1mσif(zi)]

We can see the the outcome of a hypithesis f on the whole dataset as a vector a, each coordinate corresponding to a f(zi), then the equation becomes:

R(A)=1mEσ[supaAi=1mσiai]=1mEσ[supaAσ,a]

Lem 26.6 For any ARm , scalar cR, and vector a0Rm, we have.

R({ca+a0:aA})=|c|R(A)

就是将loss函数放大c倍的时候,R complexity也跟着放大这么多倍。

Lemma 26.7 Let A be a subset of Rm and C={cj0,c1=1}RN . And let A={j=1Ncja(j):cC,a(j)ARm}. Then R(A)=R(A)

(证明的关键是maxcc,v=maxjvj)

Massart Lemma

Let A={a1,..,aN} be a finite set of vector in Rm. Define a¯=1Ni=1Nai. Then:

R(A)maxaAaa¯2log(N)m

Proof: 其实证明有点像Chernoff bound,先

Without loss of generality a¯=0. Let λ>0. Let A={λa1,...,λan}. We upper bound the Rademacher complexity as follows:

mR(A)=Eσ[maxaσ,a]=Eσ[log(emaxaσ,a)]=Eσ[log(maxaeσ,a)](loge)Eσ[log(aeσ,a)]log(Eσ[aeσ,a])()=log(Eσ[ai=1meσiai])=log(ai=1mEσi[eσiai])

Eσi[eσiai]=eai+eai2eai2/2()

mR(A)=log(ai=1meai2/2)=log(aea22/2)log(|A|maxaea22/2)=log(|A|)+maxa(a222)

R(A)log(|A|)+maxa(a222)λm

通过设置λ来达到最后的那个值

Contraction Lemma

For each i[m], let ϕi:RR be a ρ-smooth function. let ϕ(a) denote (ϕ1(a1),ϕm(ym)). Let ϕA={ϕ(a):aA}. Then R(ϕA)ρR(A).

proof: omit

也是跟loss函数有关,当loss函数再复合一个ρ-smooth function,R也跟着放大ρ

Lem26.2

ESDm[RepD(F,S)]2ESDmR(FS)

Proof: 关键就是要怎么加上R random variable

RepD(F,S)=supfF(LD(f)LS(f))=supfF(ES[LS(f)]LS(f))=supfF(ES[LS(f)LS(f)])ES[supfF(LS(f)LS(f))]

So:

ESDm[RepD(F,S)]ES,S[supfF(LS(f)LS(f))]1mES,S[supfFi=1m(f(zi)f(zi))]=1mES,S,σ[supfFi=1mσi(f(zi)f(zi))]

We want to prove the last equality. Pick a specific index j to do the operation:

ES,SDm[supfFσj(f(zj)f(zj))]=12ES,SDm[supfF(f(zj)f(zj))]+12ES,SDm[supfF(f(zj)f(zj))]=ES,SDm[supfF(f(zj)f(zj))](SS)

然后repeat for 所有的 zj。这样我们就可以证明那个不等式。

1mES,S,σ[supfFi=1mσi(f(zi)f(zi))]1mES,σ[supfFi=1mσif(zi)]+1mES,σ[supfFi=1mσif(zi)]=ESDmR(FS)+ESDmR(FS)=2ESDmR(FS)

Thm26.3 相当于将ERM的结果代入前面那个lemma

再次证明了ERM的强大power

Thm 26.5 两个重要公式
Assume that for all z and hH, we have that |l(h,z)|c, then:

  1. with 1δ probability, h:

    LD(h)LS(h)2ESDmR(lHS)+c2ln(2δ)m

  2. with 1δ probability, h:

LD(h)LS(h)2R(lHS)+4c2ln(4δ)m

Rademacher complexity of linear classes as an example

Let S=(x1,x2,...,xm) be vectors in a Hilbert space. Define H2S={wS=(w,x1,...,w,x1):w2C}. Then

R(H2S)maxixi2  Cm

mR(H2S)=Eσ[supwi=1mσiw,xi]=Eσ[supww,i=1mσixi]CEσ[i=1mσixi2]C(Eσ[i=1mσixi22])1/2()=C(ijxi,xjEσ[σiσj]+i=1xi,xiEσi2)1/2=C(i=1xi22)1/2Cmmaxixi22