Chapter 4: Supervised Learning

Acknowledgment: Most of the knowledge comes from Yuan Yang's course "Machine Learning".

Linear Regression

就是一个很传统的统计学任务。
用最小二乘法可知,w=(XX)1XY. But the inverse is hard to compute.

We can use gradient descent. Define the square loss:

L(f,xi,yi)=12(f(xi)yi)2

What if the task is not regression but classification?

Consider binary classification at first. A naive way is to use positive and negative to represent 2 classes. However, the gradient can not give meaningful gradient information.

方法一:没有导数我就设计一种不需要导数信息的算法——感知机

方法二:把硬的变软,人为创造出导数——logistics regression

Perceptron

Intuition: adding x to w will make wx larger.

Limitation: it can only learn linear functions, so it does not converge if data is not linearly separable.

Update rule: Adjust w only when make mistakes. If y>0 and wx<0:

w=w+x

If y<0 and wx>0:

w=wx

combine these 2 into if ywx<0 (yi{1,+1})

w=w+yixi

Convergence Proof:

Assume w=1 and \exist γ>0  s.t.   i  yiwxiγ.
And xiR. Then at most R2γ2 mistakes.

wt+1,wwt,w+ytxt,w  (update rule)wt,w+γ  (margin assumption)

Start from w0=0 Telescoping

wt+1,wtγ

Then

wt+1=wt+1wwt+1,wtγ

On the other hand:

wt+12=wt+ytxt2=wt2+ytxt2+2ytxt,wtwt2+ytxt2  (update only when mistake)wt2+R2  (R-condition)

Telescoping:

wt+12tR2

So

t2γ2wt+12tR2tR2γ2

Logistic Regression(逻辑回归)

将分类问题转化为概率的回归问题。

Instead of using a sign function, we can output a probability. Here comes the important thought we have used in matrix completion. Relaxation!!!

Make the hard function sign(z) soft:

11+ez

It remains to define a loss function. L1 and L2 are both not good enough, let's come to cross entropy:

L(y,p)=iyilogpi

Explanation:

  • We already know the actual probability distribution y.

  • We estimate the difference of pi and yi.

feature learning

Linear regression/classification can learn everything!

  • If the feature is correct

In general, linear regression is good enough, but feature learning is hard

Deep learning is also called “representation learning” because it learns features automatically.

  • The last step of deep learning is always linear regression/classification!

Regularization

This is a trick to avoid overfitting.

Ridge regression

minL=12Ni(wTxiyi)2+λ2w22wL=1Ni(wTxiyi)xi+λwH=1NixixiT+λIλ

This is λ-strongly convex.

An intuitive illustration of how this works. For each gradient descent step, split it into two parts:

  • Part 1: Same as linear regression
  • Part 2: “Shrink” every coordinate by (1ηλ) (weight decay is a very important trick today)

Until two parts “cancel” out, and achieves equilibrium.

However, ridge regression can not find important features. It is essentially linear regression + weight decay.

Although the w vector will have a smaller norm, every feature may get some(possibly very small) weight

If we need to find important features, we need to optimize:

minL=12Ni(wTxiyi)2and at same time 𝑤0c

c means we want to ensure that there are at most c non-zero entries, i.e., those are themost important c features. Other features are not important.

Then relax "hard" to "soft" as we always do. Then we do LASSO regression:

LASSO regression

minL=12Ni(wTxiyi)2+λw1wL=1Ni(wTxiyi)xi+λ sign(w)

An intuitive illustration of how this works. For each gradient descent step, split it into two parts:

  • Part 1: Same as linear regression
  • Part 2: for every coordinate i:

    (1)wt+1(i)={w^t(i)ηλ,for w^t(i)>ηλw^t(i)+ηλ,for w^t(i)<ηλ0,for w^t(i)[ηλ,ηλ]

Until two parts “cancel” out, and achieves equilibrium.

Compressed Sensing

Definition of RIP condition:

Let WRn×d . W is (ϵ,s)-RIP if for all x0s, we have

(1ϵ)x22Wx22(1+ϵ)x22

This is called the Restricted Isometry Property (RIP). It means W is (almost) not expanding x towards any direction, as long as x is sparse. It is close to the identity matrix in the "sparse space of x," whatever it means.

Without the sparsity condition, this is impossible (think about W's eigenspace; there exists an eigenvector with a 0 eigenvalue).

Naïve application for RIP (Theorem 1)

Theorem: Let ϵ<1 and let W be a (ϵ,2s)-RIP matrix. Let x be a vector such that x0s, let y=Wx be the compression of x, and let

x~argminv:Wv=yv0

be a reconstructed vector. Then x~=x.

x is the only vector that gives y as the result after applying W under sparse conditions.

proof: 反证法 注意到是 (ϵ,2s)-RIP

If not, i.e., x~x. Since Wx=y, we know x~0x0s, so xx~02s. We apply RIP condition on xx~ and get

(1ϵ)xx~22W(xx~)22(1+ϵ)xx~22

Notice that W(xx~)22=0, and xx~220, so we have 0(1ϵ)0(1+ϵ), which is a contradiction.

这个定理还是有一样的问题,0阶不好优化,需要relax到一阶

Theorem2: 0-norm 1-norm

Let ϵ<1 and let W be a (ϵ,2s)-RIP matrix. Let x be a vector such that x0s, let y=Wx be the compression of x, and ϵ<11+2. Then

x=argminv:Wv=yv0=argminv:Wv=yv1

Theorem 2 + x contains noise (Theorem 3)

Let ϵ<11+2 and let W be a (ϵ,2s)-RIP matrix.

Let x be an arbitrary vector and denote xsargminv:v0sxv1.

That is, xs is the vector which equals x on the s largest elements of x and equals 0 elsewhere.

Let y=Wx be the compression of x and let xargminv:Wv=yv1 be the reconstructed vector. Then:

xx22(1+ρ)s1/21ρxxs1

where ρ=2ϵ1ϵ. When x=xs (s-sparse), we get exact recovery x=x.

Proof:

Before the proof, we need to clarify some notations.

Given a vector v, a set of indices I, denote vI as the vector whose ith element is vi if iI, otherwise, the ith element is 0.

We partition indices as: [d]=T0T1Tds1. Each |Ti|=s, and assume ds is an integer for simplicity.

T0 contains the s largest elements in absolute value in x. Then T0c=[d]T0.

T1 contains the s largest elements in hT0c. (not in x!). T0,1=T0T1 and T0,1c=[d]T0,1.

T2 contains the s largest elements in hT0,1c. T3,T4, are constructed in the same way.

Then based on our notation, xxs1=xT0c1, 后面的证明只要都转到xT0c1 即可

Then begin our proof:

Let h=xx. We want to show that h2 is small.

We split h into two parts: h=hT0,1+hT0,1c.

hT0,1 is 2s-sparse, we move large elements inside, then use RIP to bound it.

hT0,1c is the rest small entries, we use intuition (袁洋老师经典的棍子问题) to bound it.

Step1: bound hT0,1c2hT02+2s1/2xxs1

The first inequality is by definition of infinity norm. Sum over j=2,3, by triangle inequality, we have

We want to show hT0c1 will not be too big.

hT0,1c2j2hTj2  (均值不等式)j2shTj  (根据无穷norm的定义)j2hTj11s  ()=hT0c1s  (根据1-norm的定义)

解释一下这个式子(*): For any j>1, iTj, iTj1, |hi||hi| (by definition of T). So

hTjhTj11s

Then we want to bound hT0c1

x=x+h has minimal 1 norm in all vector satisfy Wx=y, and x satisfies Wx=y.

So:

x1x+h1=iT0|xi+hi|+iT0c|xi+hi|xT01hT01+hT0c1xT0c1  (绝对值不等式)hT0c1hT01+xT0c1+x1xT01=hT01+2xT0c1  (1-norm可以这样直接加减运算,真好)s1/2hT02+2xT0c1  (均值不等式)s1/2hT0,12+2xT0c1  (柯西不等式)

因为我们等一下也要bound|hT0,1|2 所以这边放缩到这里,也是没有问题的。

总结一下,经过step1之后,我们得到了什么

h2hT0,12+hT0,1c2  (均值不等式)s1/2hT0c1+hT0,1c2s1/2(s1/2hT0,12+2xT0c1)+hT0,1c22hT0,12+2s1/2xT0c1

Step2: Then we want to bound hT0,12

Use RIP condition bound these big terms:

(1ϵ)hT0,122WhT0,122=WhT0,1,WhT0,1=j2WhT0+WhT1,WhTj

The last equality is because WhT0,1=Whj2WhTj=j2WhTj,

The final result is the inner products between disjoint index sets. We use a very useful lemma:

Theorem: Let W be (ϵ,2s)-RIP, I,J disjoint sets of size s, for any vector μ we have WuI,WuJϵuIuJ

proof of this small thm:

WLOG assume uI=uJ=1.

WuI,WuJ=WuI+WuJ2WuIWuJ24  ()

We can bound these two terms by RIP condition:

Since |JI|2s, we get from RIP condition that (notice uI,uJ=0)

WuI+WuJ2=W(uI+uJ)2(1+ϵ)(uI2+uJ2)=2(1+ϵ)WuIWuJ2=W(uIuJ)2(1ϵ)(uI2+uJ2)=2(1ϵ)

Now come back to the original proof.

Therefore, we have for i{0,1}, j2: |WhTi,WhTj|ϵhTi2hTj2.

Continue:

(1ϵ)hT0,122j2WhT0+WhT1,WhTjϵj2(hT02+hT12)hTj2  (前面的那个定理)2ϵhT0,12j2hTj2  (均值不等式)2ϵhT0,12hT0c1s  (前面证过)hT0,122ϵ1ϵs1/2hT0c1

hT0c 我们是bound过的呀,但是这里又转到 hT0,12,互相bound?这么奇怪。

Continue: 设2ϵ1ϵ=ρ

hT0,12ρs1/2hT0c1ρhT0,12+2ρs1/2xT0c1hT0,122s1/2ρ1ρxT0c1

然后我们把这个再代入前面的式子:

h22hT0,12+2s1/2xT0c12s1/2(2ρ1ρ+1)xT0c1=2s1/2(1+ρ)1ρxT0c1

这样就证完了。

How to construct a RIP matrix?

Random matrix

Theorem 4: Let U be an arbitrary fixed d×d orthonormal matrix, let ϵ,δ be scalars in (0,1). Let s be an integer in [d], let n be an integer that satisfies:

n100s log(40dδϵ)ϵ2

Let WRn×d be a matrix such that each element of W is distributed normally with zero mean and variance of 1/n. Then with a probability of at least 1δ over the choice of W, the matrix WU is (ϵ,s)-RIP.

解释:

  • Matrix U can be identity, so W is also (ϵ,s)-RIP.

  • Before we delve deeper into this proof, we discuss a question: Why is WU useful?

    • y=Wx, W is RIP, but x can not always be sparse.
    • But also y=(WU)a, WU is still RIP, U is orthonormal, a is sparse. 也就是可以加一个任意的正交的矩阵,将不sparse的输入变成sparse。

The rough idea of this proof is:

RIP condition holds for all possible (infinitely many) sparse vectors, but we want to apply union bound. So:

  • Continuous space finite number of points

  • Consider a specific index set I of size s(get this RIP condition on this specific index set)

  • Use this index set to enter the sparse space

  • Apply union bound over all possible I

Let's begin our proof:

Lemma 2:

Let ϵ(0,1). There exists a finite set QRd of size |Q|(5ϵ)d such that:

supx:x1minvQxvϵ

意义:

我们能够用有限的点,覆盖一个高维空间(此处覆盖的意思是,高维空间中的任意一点,到我们取的有限的点集的最小距离,都很小)。也就是对应我们idea中的第一步——"Continuous space finite number of points"

Proof:

假设ϵ=1/k. 把k上取整成整数,let

Q={xRd:j,i{k,k+1,,k} s.t. xj=i/k}

也就是Q中的每个点,的所有坐标,都是[1,1]之间的的2k+1分点上。那这样,空间中的任意一点都一定处在一个方格中,到方格顶点的距离,最大是1kϵ.

Clearly, |Q|=(2k+1)d(每个维度2k+1个点嘛).

We shall set Q=QB2(1), where B2(1) is the unit L2 ball of Rd. 然后应该还有一些计算,会得到(5ϵ)d这个答案。

JL lemma:

Let Q be a finite set of vectors in Rd. Let δ(0,1) and n be an integer such that:

ϵ=6ln(2|Q|δ)n3

n=6ln(2|Q|δ)ϵ2

Then, with probability of at least 1δ over a choice of a random matrix WRn×d such that each element of W is independently distributed according to N(0,1/n), we have:

supxQ|Wx2x21|<ϵ

意义:

  • 这个定理已经就和我们最后要证的RIP的条件几乎一模一样了。只不过这里x不是sparse,而是在finite set中取。也就是对应我们idea中的第二步——"Consider a specific index set I of size s"

  • n的取值大小,和维数d没有关系了,只和finite set的size d有关,那么我们就可以用一个很小的Q和一个维数很低的n*d的矩阵W来做操作。

proof: omit here.

Lemma 3:

Let U be an orthonormal d×d matrix and let I[d] be a set of indices of size |I|=s. Let S be the span of {Ui:iI}, where Ui is the i-th column of U. Let δ(0,1), ϵ(0,1), and n be an integer such that:

n24ln(2δ)+sln(20ϵ)ϵ2

Then, with probability of at least 1δ over a choice of a random matrix WRn×d such that each element of W is independently distributed according to N(0,1/n), we have:

supxS|Wxx1|<ϵ

意义:

lemma3已经非常接近我们最后要证的东西了,也就是对应idea中的第三步——"Use this index set to enter the sparse space"

首先解释一下: S是index的子集对应的多个列向量张成的向量空间,那其中的元素x,就是他们的线性组合,所以We can write x=UIa where aRs

If Lemma 3 is true, we know for any xS,

(1ϵ)Wxx(1+ϵ)

Which implies that

(13ϵ)Wx2x2(1+3ϵ)  (就是同时平方)(13ϵ)WUIa2UIa2(1+3ϵ)  (x=UIa)(13ϵ)WUIa2a2(1+3ϵ)

因为a是unit length,所以我们只需要bound WUIa或者说Wx

只要lemma3正确,然后做第四步——"apply union bound over all possible I" (from (ds)) i.e., δ=δds,我们就证完了。

proof:

大致思路已经很清晰了,首先用JL lemma bound住离散的点v上的这个模长WUIv,再用lemma2 bound从离散集合到全空间所有点的一个差距av也就是WUI(av),那我们也就bound住了WUIa.

It suffices to prove the lemma for all xS of the unit norm.

We can write x=UIa where aRs, a2=1, and UI is the matrix whose columns are {Ui:iI}. 这里简单解释一下: 我前面说过,x是列向量的线性组合,所以写成了x=UIa,又因为xU的列向量都是unit length,U的列向量还相互之间正交,所以a也是unit length

Using Lemma 2 we know that there exists a set Q of size |Q|(20ϵ)s such that:

supa:a=1minvQavϵ/4

Since U is orthonormal, we also have

supa:a=1minvQUIaUIvϵ/4

也就是:

supxSminvQxUIvϵ/4

Apply JL lemma on {UIv:vQ}, we know that for:

n46ln(2|Q|δ)ϵ2=24ln(2δ)+sln(20ϵ)ϵ2

We can have:

supvQ|WUIv2UIv21|ϵ/2

This also implies

supvQ|WUIvUIv1|ϵ/2

Let a be the smallest value such that for all xS, Wxx1+a, clearly a<.(显然有上确界,然后设出来,然后求他) We want to show a<ϵ. This is the right half of the RIP condition.

 xS of unit norm, there exists vQ such that xUIvϵ/4, and therefore

WxWUIv+W(xUIv)(1+ϵ/2)UIv+W(xUIv)  (JL lemma的条件作用在某个index的离散点集上)(1+ϵ/2)UIv+(1+a)ϵ/4  (lemma2和我们自己加的上确界假设)

Notice that UIv1, because UI does not change length, and v is in Q(Q 肯定在单位球内). So

Wx1+ϵ2+(1+a)ϵ/4Wxx1+ϵ2+(1+a)ϵ4

By definition of a, we know

aϵ2+(1+a)ϵ4a3ϵ41ϵ4ϵ

There may be some problem with the left half in PPT. TODO

Between this lemma 3 and the thm may be a gap, but 老师上课没讲。

作业中还有提到用random orthonormal basis 来构造RIP matrix的,这个在后面decision tree中会出现。

Support Vector Machine

概念解释

支持向量机(SVM)是一种二分类模型,它的基本模型是定义在特征空间上的间隔(margin)最大的线性分类器。

Margin: distance from the separator to the closest point

Samples on the margin are "support vectors".

数学表达式:

找到一个超平面wx+b=0, s.t.

for yi=1,wxi+b>1for yi=1,wxi+b<1min1w2

If the space is perfectly linear separable, this can be solved using quadratic programming in polytime.

那如果不是线性可分的呢?

Naïve answer:
minw2+λ mistakes

i.e.

minw2+λi1{yi(wTxi)<1}

This is NP-Hard to minimize.

The indicator function is hard to optimize. Make it soft.

Relax the hard constraint:

minw2+λiξii,yiwTxi1ξi,ξi0

ξi is a "slack variable".

"Violating the constraint" a little bit is OK. SVM with "Soft" margin.

This is called SVM with a "Soft" margin.

计算

Then we use "Dual" to solve SVM:

线性可分的情况:

Primal:

minw222i,yiwTxi1

Dual:

L(w,α)=w222iαi(yiwTxi1)i,αi0

Take derivative:

Lw=0w=iαiyixi

Plug it into L(w,α), we get:

L(w,α)=iαi12ijyiyjαiαjxi,xj

The relaxed case

Primal:

min12w22+λiξii,yi(wTxi+b)1ξi,ξi0

The Lagrangian:

L(w,b,ξ,a,κ)=12w22+λiξiiai(yi(wTxi+b)1+ξi)iκiξi

where αi0, κi0

Take derivative, we get the optimality conditions:

Lw=wiaiyixi=0Lb=iyiai=0Lξi=λaiκi=0

So, ai=λκiλ. Here is one more constraint than the linearly separable case.

Then we put all the solution into L(w,b,ξ,a,κ):

L(w,b,ξ,a,κ)=12ijyiyjaiajxi,xj+i(λiκi)ξiiaiyiwTxi+iaiyibi(=0)+iaiiaiξi=12ijyiyjaiajxi,xj+iaiξiijyiyjaiajxi,xj+iaiiaiξi=iai12ijyiyjaiajxi,xj

The last formulation is the same as the linearly separable case.

kernel trick

We can use a kernel function to transform non-linearly separable into high dimensional space, in which the data is linearly separable.

Two problems:

  • kernal function ϕ(x) dim too high, hard to compute.

  • The separator has a higher dim, hard to compute.

For ϕ(x1),ϕ(x2), we can get it from x1,x2. e.g. Quadratic kernel in PPT and Gauss kernel we always use.

maxiai12ijyiyjaiajϕ(xi),ϕ(xj)

  • Only need to do n2 kernel computations ϕ(xi),ϕ(xj)

When a new data point x comes, what do we do? Not computing ϕ(x) -- can be expensive!

  • wTϕ(x)=iaiyiϕ(xi),ϕ(x)
    Assume r support vectors, then we only need to do at most r kernel computations

Therefore, no need for computing ϕ(x)!

Move a step further, do we need to define ϕ(x)?

Mercer’s Theorem: If the kernel matrix is positive semidefinite for any {xi}, then we know there exists ϕ() such that K is the kernel for ϕ()

That means we may define a K function for SVM, without knowing the exact format of ϕ()!!!

Decision Tree

pro: good explanations

con: big and deep, hard to switch, overfit easily.

前置知识

boolean function analysis: f:{1,1}n[0,1]

Fourier basis of boolean function: χS(x)=iSxi,S[n]

f(x)=Sf^SχS(x),f^S=f,χS(x)=ExD[f(x)χS(x)]

L1(f)=S|f^S|

L0(f)=|S:f^S0| sparsity 就是L0(f)

A low degree means the degree of all its terms is bounded by this number. 也就是S[n],|S|degree

Convert decision tree to low-degree sparse function

Thm: for any decision tree T with s leaf nodes, there exists a degree-logsϵ and sparsity-s2ϵ function h that 4ϵ-approximates T.

Proof:

Step 1: bounded its depth. Truncating T at depth logsϵ

This means there are at most sϵ nodes remaining only when it is a full binary tree.

It differs by at most ϵs×s=ϵ. 这是为什么呢?因为走到每个最后一层的node的概率是ϵs,然后最多有s个node底下连到最终的leaves,也就是会被切掉。

(注意:这里已经造成了ϵ的误差了)

So below we assume T has depth at most logsϵ

Step 2: bounded its degree and L1 norm

A tree with 𝑠 leaf nodes can be represented by union of 𝑠 “AND” terms

这是为什么呢?以下面这个图为例,假如我想要走到其中一个node,比如x2右侧的那个1,必须x1=1,x2=1, 那么这个node就可以被写成x1x2

每个node都可以这样表示:iS(xi=flagi),flagi{1,1}

那么这个函数的各个系数分别是多少,可以用这个公式 f^S=f,χS(x)=ExD[f(x)χS(x)]进行计算。算出来一定刚好是

x1x2=14+14x1+14x2+14x1x2

Because every AND term has L1=1 and at most logsϵ variables(depth is at most logsϵ), so L1(f)s and degree of f bounded by logsϵ

Step 3

We need to prove:

For f s.t. L1(f)s and degree at most logsϵ, there is a function h with L0s2ϵ s.t. E[(fh)2]ϵ.

Proof:

Let h include all terms in f s.t. |f^S|ϵL1(f). degree的条件肯定满足了.

这样的term最多只有L1(f)/ϵL1(f)=L1(f)2ϵs2ϵ. Sparsity的条件也满足了.

By Parseval identity, the missing term has a contribution at most:

E[(fh)2]=Sh(f^S)2maxShf^SShf^SϵL1(f)×L1(f)=ϵ

这样误差的要求也满足。

上下两个误差相结合即可。

好了,现在我们已经把学一个决策树的问题转化为学一个函数的问题,而且这个函数在bool function basis上是low degree and sparse. 我们要怎么求这个函数在这组基分解上的系数呢?

Theoretically analysis

KM algorithm: (Not required)

Key point: recursively prune the less promising set of basis and explore a promising set of basis.

fα means the function of all the Fourier basis starting with α summed together.

以一个由x1,x2,x3三个变量的函数为例

f1=f^x1x2x3x2x3+f^x1x2x2+f^x1x3x3+f^x1  有x1的项,然后x1=1f0=f^x2x3x2x3+f^x2x2+f^x3x3+f^1  没有x1的项

All these functions are well defined and also satisfy parseval identity.

e.g.

E[f11]=f^x1x2x32+f^x1x22

Algorithm process:

def Coef(a):

​ If E[fa2]θ2:

​ If |a|=n, then output a the degree can not be larger than n

​ Else Coef(a0); Coef(a1) // 感觉这也太暴力了吧qwq

大概就是只留下,系数比therohold大的节点,但是尽可能延展,然后又不会太长的节点。

当然这里大家肯定有疑问,我们要怎么样计算E[fa2]呢?

Lem3.2:

For any function f, any 1k<n, any α[1,1]k, and any x[1,1]nk, fα(x)=Ey[1,1]k[f(yx)χα(y)].

大概就是说,除了α确定的那些bool变量,剩下的没有确定的bool变量,uniformly取所有可能,然后求均值。

This formulation implies that even though we do not know how to compute the value of 𝑓α(𝑥) we can approximate it, by approximating the above expectation.

这个算法有two problem:

  • Pretty slow
  • Sequential algorithm, cannot be done in parallel

LMN Algorithm:

For every S[n] with low degree, we estimate f^S using m samples:

f^S=1mif(xi)χS(xi)

Do it for all S to get the function.

sample complexity is small and parallelizable.

Two problems:

  • does not work well in practice
  • does not have guarantees in the noisy setting.

From the point of view compressed sensing, we can control the error, because Fourier basis {χ𝑆} is a random orthonormal family.

Harmonica: compressed sensing

y=Ax

x is coefficient for Xs, sparse in our case.

y 是整个函数的在不同取值下的结果,yi=f(xi)

A: aij is XSj(xi) 第i行代表的是第i个输入,第j列代表是第j个函数基的结果。

这个boolean function analysis是一个orthogonoal random matrix. 可以做compressed sensing 来复原x .

In practice, how to build a decision tree? Gini index

For a specific node, we have to distinguish a set of variables. Then we need to determine the next node.

We use the Gini index to measure the uncertainty of a variable relative to the decision outcome.

We pick the variable with the smallest Gini index/uncertainty.

Gini(v)=Pr[v=0]Gini(v=0)+Pr[v=1]Gini(v=1)Gini(v=0)=1Pr[result=0|v=0]2Pr[result=1|v=0]2Gini(v=0)=1Pr[result=0|v=1]2Pr[result=1|v=1]2

Theoretically, if the decision tree is super deep, it can fit anything. But overfitting. We should avoid it, how? random forest.

Random forest (one of the best tools in practice)

Bagging

  • Bagging是并行式集成学习方法最著名的代表。它直接基于自助采样法(bootstrap sampling)

随机森林是Bagging的一个扩展变体。

课件上面是说,我有放回的从原本的训练集中sample n次,构造一个大小一样的,但是(可能)有重复元素的集合。就相当于一个带权的子集。然后在这上面跑构造decision tree的算法。

重复B次,得到B个tree,然后取平均。

这样子的话,得到的tree会更稳定一些。

这还只是data bagging. 我们还可以做feature bagging,就是排除某些特征,来构造tree。这样也会更好。

• Each tree can only use a random subset of features

• So the decision tree is forced to use all possible information to predict,

instead of relying on a few powerful features

Boosting

We may have lots of weak learners. We want to combine them together to get a strong learner?

That is we want to learn a fixed combination of hi.

Adaboost

key idea: we can set the weight of each sample to control.

Constructing:

​ init: D1(i)=1m,i

​ Given DT and hT:

Dt+1(i)=Dt(i)Zteαt, if yi=ht(xi) 答对了就减少权重

Dt+1(i)=Dt(i)Zteαt, if yiht(xi) 答错了就增大权重

Zt=iDt(i)eαiyiht(xi). Normalization factor.

αt=12ln(1ϵtϵt)0.

ϵt 是错误率。

αt后面会讲为什么这么设定,也就是ht的参数。错误率越小,权重越高,也很合理。

Ht=sign(tαtht(x))

Convergence Thm

γt=12ϵt,γ=maxtγtError(Hfinal)e2γ2T

Adaboost is adaptive: does not need to know γ or T as prior knowledge.

Convergence Proof

Step 1: unwrapping recurrence

DT(i)=1mtZteyitαtht(xi)

Let f(xi)=tαtht(xi), it becomes:

DT(i)=1mtZteyif(xi)

Step 2: bound error of H

(2)Error(H)=1mi1[yiH(xi)](3)=1mi1[yif(xi)0]  (H(x)=sign(f(x)))(4)1mieyif(xi)  (a soft version of the indicator function)(5)=(tZt)(iDT(i))=tZt  (每一步都是归一化的,这个D的求和自然是1)

Step 3: bound ZT

Zt=iDt(i)eαiyiht(xi)=i:yiht(xi)Dt(i)eαi+i:yi=ht(xi)Dt(i)eαi=ϵteαi+(1ϵt)eαi  (ϵt)

α 求导,找到最小值αt=12ln(1ϵtϵt), 所以Zt=2ϵt(1ϵt)

因为ϵt=12γt

Zt=2(12γt)(12+γt)=14γt2

tZt=t14γt2e2tγt2()

From this Thm, we can know when T+,Error(H)0. 但是这只是训练集的loss减小了,但是结合过多的weak learners容易过拟合。这就要讲接下来的这个部分 Margin-based analysis。

Margin-based analysis

Thm1: Let S be a sample of m samples chosen independently at random according to D. Assume that the base hypothesis space H is finite, and let δ>0. Then with probability at least 1δ over the random choice of the training set S, every weighted average function fC satisfies the following bound for all θ>0:

PS[yf(x)θ]PD[yf(x)θ]+O(1m(logmlog|H|θ2+log(1δ))1/2)

Define CN to be the set of unweighted averages over N elements from H

CN={g:xi=1Nhi(x)|hiH}

We allow the same hH to appear multiple times in the sum. This set will play the role of the approximating set in the proof.

Any majority vote hypothesis fC can be associated with a distribution over H as defined by the coefficient ah. By choosing N elements of H independently at random according to this distribution we can generate an element of CN. Using such a construction we map each fC to a distribution Q over CN.

这边的思想还是很妙的。就是对于我的每一个函数f,我都构造一个函数集合的概率分布,这个函数集合的多个函数求均值等于f的值。

那么如何构造这个概率分布呢?f=icihi, 那么hi被采样到的概率,就是ci。 这样去采样可以得到一个概率分布,而且CN中的函数还是有限的,这是我们最关键。

Our goal is to upper bound the generalization error of fC. For a fixed gCN and 𝜃>0 we can separate this probability into two terms:

PxD[yf(x)0]PxD[yg(x)θ2,yf(x)0]+PxD[yg(x)>θ2,yf(x)0]PxD[yg(x)θ2]+PxD[yg(x)>θ2|yf(x)0]

As this inequality holds for any gCN:

PD[yf(x)0]PxD,gQ[yg(x)θ2]+PxD,gQ[yg(x)>θ2|yf(x)0]

We first look at the second term:

Since f(x)=EgQ[g(x)], for a fixed sample xD, chernoff hoeffding bound can get:

PgQ[yg(x)>θ2|yf(x)0]PgQ[yg(x)yf(x)>θ2]=PgQ[yg(x)E[yg(x)]>θ2]eNθ2/8  g是N个东西的平均

Then we come to the first term:

Change a little bit of the statement, with a probability 1δN, we have:

PxD,gQ[yg(x)θ2]PxS,gQ[yg(x)θ2]+ϵN

现在我们其实想bound的是,在已知PxD,gQ[yg(x)θ2]的情况下,求PxS,gQ[yg(x)>θ2]的概率下界。这就有点像已知realizable assumption,求取样复杂度。

To upper bound the first term, we use the union bound.

For any signle term, g and θ, 假设:

PxS,gQ[yg(x)>θ2]>PxD,gQ[yg(x)>θ2]+ϵN

那么这个不等式成立的概率该怎么算呢?chernoff hoeffding bound

P[|xSPxS,gQ[yg(x)>θ2]mE|>mϵN]e2mϵN2

上面这个式子中的E是指E=E[PxD,gQ[yg(x)>θ2]]
然后遍历所有的|CN||H|N还有所有的θ 选择(θ 只能是2i/N for i=0,1,...,N, 因为CN 的构造方式)这个概率就变成了(N+1)|CN|e2mϵN2

接着再对这个P[] 的概率不等式事件做等效变换:

PxS,gQ[yg(x)>θ2]>PxD,gQ[yg(x)>θ2]+ϵN1PxS,gQ[yg(x)θ2]>1PxD,gQ[yg(x)θ2]+ϵNPxD,gQ[yg(x)θ2]>PxS,gQ[yg(x)θ2]+ϵN

最后这个式子的概率,应该小于δN。 所以:

δN(N+1)|CN|e2mϵN2ln(δN(N+1)|CN|)2mϵN2ϵN12mln(δN(N+1)|CN|)ϵN12mln((N+1)|CN|δN)

要保证最后这个不等是的话,由于|CN||H|N, 我们只需要保证

ϵN12mln((N+1)|H|NδN)(ϵN)

类似前面这个式子PD[yf(x)0]PxD,gQ[yg(x)θ2]+PxD,gQ[yg(x)>θ2|yf(x)0]

我们有在S 上的一个式子

PS[yg(x)θ2]PxS,gQ[yf(x)θ]+PxD,gQ[yg(x)θ2|yf(x)>θ]

By chernoff hoeffding bound:

PxD,gQ[yg(x)θ2|yf(x)>θ]eNθ2/8

Combine them together, we can get for every θ>0 and N1:

PD[yf(x)0]PxD,gQ[yg(x)θ2]+eNθ2/8PxS,gQ[yg(x)θ2]+ϵN+eNθ2/8  (1δ)PxS,gQ[yf(x)θ]+eNθ2/8+ϵN+eNθ2/8

ϵN代入,但是因为我们求的是任意的N,所以失败概率是被所有N取值的失败概率bound住的,NδN=δ,δN=δN(N+1)

最后再设置一下N的值,和θ 有关。证毕,好长的定理啊。

Modification on Adaboost

Thm2:

Suppose the base learning algorithm, when called by AdaBoost, generates hypotheses with weighted training errors ϵ1,,ϵ𝑇. Then for any 𝜃, we have that:

PxS[yf(x)θ]2Tt=1Tϵt1θ(1ϵt)1θ(Error(H)t2ϵt(1ϵt))

Proof:

Note that if yf(x)θ, then (f(x)=tαtht(xi)tαt here)

yitαtht(xi)θtαt

So:

PxS[yf(x)θ]=ExS[1[θtαtyitαtht(xi)0]]ExS[eθtαtyitαtht(xi)]  soft version of indecator function=eθtαt1mieyitαtht(xi)=eθtαttZtiDt(i)=eθtαttZt  (其实就是因为不是0是θ多了一项)

因为αt=12ln(1ϵtϵt), 所以:

eαt=ϵt(1ϵt)eθtαt=tϵtθ(1ϵt)θeθtαttZt=t2ϵt1θ(1ϵt)1+θ

这也是随着T指数级减小的。

Gradient Boosting

One drawback of adaboost: only cares about binary classification tasks.

What about regression tasks? We want to extend adaboost to regression tasks. Then we get gradient boosting.

Recall Adaboost is optimizing the following function:

  • tZt=1miexp(yif(xi))=1miexp(yitαtht(xi))
  • This will go to 0 as T goes up
  • How does it optimize? It uses coordinate descent

我们的需求是,给定了一些weak learners,我们要学一个他们的权重。

求一个全导数做gradient descent太贵了,因为weak learner数量可能是非常多的。

那就每次先固定所有其他的α ,只学一个。

process:

  • initially all αi=0
  • In each iteration of adaboost, we pick a coordinate αi and set it to make loss decrease most.
  • It is like each time, we add a weak learning h into f

因为我们每次是往f中加一个weak learner,所以,所以我们加入的新函数可以在(xi,yif(xi)) 做优化。

How is it related to gradient descent?

假如我们定义要优化的loss是 l2 loss:

L=i(yif(xi))22Lf(xi)=f(xi)yi

如果做gradient descent,ft+1(xi)=ft(xi)ηLf(xi)=ft(xi)η(f(xi)yi)yi那么更新之后就是更接近y

但是实际上,我每次不是更新Lf(xi) 而是加一个αihi. 效果也就只能勉强接近gradient descent

More generally, other losses are the same.