定理描述
对二分类问题,当假设空间是有限个函数的集合\(\mathcal{F}=\{f_1,f_2,\cdots,f_d\}\)时,对任意一个函数\(f\in\mathcal{F}\),至少以概率\(1-\delta\)使得以下不等式成立:
\(R(f)\leq\hat{R}(f)+\epsilon(d,N,\delta)\)
其中,
\(\epsilon(d,N,\delta)=\sqrt{\frac{1}{2N}(\log d+\log\frac{1}{\delta})}\)
证明该公式需要用到\(Hoeffding\)定理
\(Hoeffding\)不等式
假设\(X_1,X_2,\cdots,X_n\)是独立随机变量,满足\(P(X_i\in[a_i,b_i])=1,1\leq i\leq n\),令\(S_n=\sum_{i=1}^{n}X_i\),则对任意的\(t>0\),以下不等式成立:
\[\begin{align}
P(S_n-E[S_n]\geq t)\leq &\exp(\frac{-2t^2}{\sum_{i=1}^n(b_i-a_i)^2})\\
P(E[S_n]-S_n\geq t)\leq &\exp(\frac{-2t^2}{\sum_{i=1}^n(b_i-a_i)^2})
\end{align}
\]
\(Hoeffding\)不等式的证明
先证明\(Hoeffding\)不等式的一个引理:
引理
对于一个随机变量\(X\),如果\(P(X\in [a,b])=1,E(X)=0\),则对任意\(s>0\)有:
\[\begin{align}
E(e^{sX})\leq e^{\frac{1}{8}s^{2}(b-a)^{2}}
\end{align}
\]
证明:
- 首先,若\(a=b=0\),根据题设显然有\(P(X=0)=1\),那么:
\(E(e^{sX})=\int_0^0 p(x)\cdot e^{sx}dx=1\leq e^{\frac{1}{8}s^2(0-0)^2}=1\)
- 又若\(a=b\neq0\),那么\(E(X)=\int_a^a x\cdot p(x)dx=a\neq0\),与题设矛盾,所以必不可能有\(a=b\neq0\)
- 若\(a=0\),由于\(E(x)=\int_{0}^{b}x\cdot P(x)dx=0\cdot P(0)+\int_{0^+}^{b}x\cdot P(x)dx\),
上式右半部分\(\int_{0^+}^{b}x\cdot P(x)dx\)满足\(x>0,P(x)\geq 0\),所有应有\(\int_{0^+}^{b}x\cdot P(x)dx\geq 0\)
又根据题设\(E(X)=0\),所有必有\(P(X=0)=1,P(X\neq 0)=0\),于是:
\[\begin{align}
E(e^{sX})&=\int_{0}^{b}p(x)\cdot e^{sx}dx\\
&=p(0)\cdot e^{s\cdot 0}+\int_{0^+}^{b}p(x)\cdot e^{sx}dx\\
&=1\cdot e^{0}+0=1\\
&\leq e^{\frac{1}{8}s^2b^2}=e^{\frac{1}{8}s^2(b-a)^2}
\end{align}
\]
- 考虑剩下的情况,此时根据\(E(X)=\int_a^b x\cdot p(x)dx=0\),必有\(a<0,b\geq 0\)
注意到\(e^{sX}\)是关于\(X\)的一个凸函数,所以根据\(Jensen\)不等式有:
\[\begin{align}
e^{(\frac{b-X}{b-a}sa+\frac{X-a}{b-a}sb)}&=e^{sX}\\
&\leq \frac{b-X}{b-a}e^{sa}+\frac{X-a}{b-a}e^{sb}
\end{align}
\]
两边同时对\(X\)取期望,并代入\(E(X)=0\)得到:
\[\begin{align}
E(e^{sX})&\leq \frac{b-E(X)}{b-a}e^{sa}+\frac{E(X)-a}{b-a}e^{sb}\\
&=\frac{b}{b-a}e^{sa}-\frac{a}{b-a}e^{sb}\\
&=(-\frac{a}{b-a})e^{sa}(e^{sb-sa}-\frac{b}{a})
\end{align}
\]
令\(\theta=-\frac{a}{b-a}>0\),上式右边就变成了:
\[\begin{align}
\theta e^{-s\theta(b-a)}(\frac{1}{\theta}-1+e^{s(b-a)})&=(1-\theta+\theta e^{s(b-a)})e^{-s\theta (b-a)}\\
&=e^{\log[1-\theta+\theta e^{s(b-a)}]e^{-s\theta(b-a)}}\\
&=e^{-s\theta(b-a)+\log[1-\theta+\theta e^{s(b-a)}]}\\
\end{align}
\]
令\(u=s(b-a)\),并且定义\(\varphi\):
\[\begin{align}
\left\{ \begin{array}{l} \varphi : R\rightarrow R\\ \varphi(u)=-\theta u+\log(1-\theta+\theta e^{u}) \end{array} \right.
\end{align}
\]
由\(e^{u}>0,a<0,b\geq0,\theta>0\),有:\(1-\theta+\theta e^{u}=\theta(\frac{1}{\theta}-1+e^{u})=\theta(-\frac{b}{a}+e^{u})>0\),所以\(\varphi\)的定义是合理的。
将\(\varphi\)代入\(E(e^{sX})\)得到:
\[\begin{align}
E(e^{sX})\leq e^{\varphi(u)}
\end{align}
\]
对\(\varphi\)进行泰勒中值定理展开,存在一个\(v\in[0,u]\)使得:
\[\begin{align}
\varphi(u)=\varphi(0)+u\varphi^{\prime}(0)+\frac{u^{2}}{2!}\varphi^{\prime\prime}(v)
\end{align}
\]
计算得到:
\[\begin{align}
\varphi(0)&=0\\
\varphi^{\prime}(0)&=-\theta+\frac{\theta e^{u}}{1-\theta+\theta e^{u}}|_{u=0}=0\\
\varphi^{\prime\prime}(v)&=\frac{\theta e^{u}(1-\theta+\theta e^{u})-\theta e^{u}\theta e^{u}}{(1-\theta+\theta e^{u})^{2}}|_{u=v}\\
&=\frac{(1-\theta)\theta e^{v}}{(1-\theta+\theta e^{v})^{2}}\\
&=\frac{1-\theta}{1-\theta+\theta e^{v}}\cdot\frac{\theta e^{v}}{1-\theta+\theta e^{v}}\\
&=t(1-t)\leq\frac{1}{4}
\end{align}
\]
其中,\(t=\frac{1-\theta}{1-\theta+\theta e^{v}}\)。
因此得到:
\[\begin{align}
\varphi(u)\leq0+0+\frac{1}{2}u^{2}*\frac{1}{4}=\frac{1}{8}u^{2}=\frac{1}{8}s^{2}(b-a)^{2}
\end{align}
\]
引理得证!
\(Markov\)不等式
接下来证明需要用到\(Markov\)不等式,该不等式属于概率论与数理统计课程的必修内容,相信难不倒大部分的读者。这里还是将定理和证明誊抄在下。
令\(X\)为非负随机变量,且假设\(E(X)\)存在,则对任意\(t>0\),有:
\[\begin{align}
P(X\geq t)\leq\frac{E(X)}{t}
\end{align}
\]
证明如下:
假设\(X\in[a,b],a\geq 0\),容易得到:
\[\begin{align}
a=a\cdot\int_a^b p(x)dx\leq E(X)=\int_a^b x\cdot p(x)dx\leq b\cdot\int_a^b p(x)dx=b\\
\end{align}
\]
即\(a\leq E(X)\leq b\)
如果\(t\leq a\),\(P(X\geq t)=1=\frac{t}{t}\leq\frac{a}{t}\leq\frac{E(X)}{t}\),
如果\(t\geq b\),\(P(X\geq t)=0=\frac{0}{t}\leq\frac{a}{t}\leq\frac{E(X)}{t}\)。
如果\(t\in(a,b)\),有:
\[\begin{align}
E(x)&=\int_a^b x\cdot p(x)dx\\
&=\int_a^t x\cdot p(x)dx + \int_t^b x\cdot p(x)dx\\
&\geq \int_t^b x\cdot p(x)dx\\
&\geq t\cdot \int_t^b p(x)dx\\
&=t\cdot P(X\geq t)
\end{align}
\]
\(Markov\)不等式得证!
接下来证明\(Hoeffding\)不等式
对于\(X_1,X_2,\cdots,X_n,n\)个独立的随机变量,其中\(P(X_i\in[a_i,b_i])=1,1\leq i\leq n\),令\(S_n=\sum_{i=1}^{n}X_i\),根据\(Markov\)不等式,有:
\[\begin{align}
P(S_n-E[S_n]\geq t)&=P(e^{s(S_n-E[S_n])}\geq e^{st})\\
&\leq e^{-st}E[e^{s(S_n-E[S_n])}]\\
&=e^{-st}E[e^{s(\sum_{i=1}^{n}X_i-E[\sum_{i=1}^{n}X_i])}]\\
&=e^{-st}E[e^{s(\sum_{i=1}^{n}(X_i-E(X_i)))}]\\
&=e^{-st}E[\prod_{i=1}^{n}e^{s(X_i-E(X_i))}]\\
&=e^{-st}\prod_{i=1}^nE[e^{s(X_i-E(X_i))}]\\
\end{align}
\]
令\(Y_i=X_i-E(X_i)\),有:
\[\begin{align}
E(Y_i)=E(X_i-E(X_i))&=\int_{a_i}^{b_i} p(x_i)\cdot (x_i-E(X_i))dx_i\\
&=\int_{a_i}^{b_i}p(x_i)\cdot x_i dx - \int_{a_i}^{b_i}p(x_i)\cdot E(X_i)dx_i\\
&=E(X_i)-E(X_i)\int_{a_i}^{b_i}p(x_i)dx_i\\
&=E(X_i)-E(X_i)=0
\end{align}
\]
满足引理的条件,所以上式可化为:
\[\begin{align}
P(S_n-E[S_n]\geq t)&\leq e^{-st}\prod_{i=1}^nE[e^{s(Y_i)}]\\
&\leq e^{-st}\prod_{i=1}^{n}e^{\frac{1}{8}s^2(b_i-a_i)^2}\\
&=\exp(-st+\frac{1}{8}s^2\sum_{i=1}^n(b_i-a_i)^2)\\
\end{align}
\]
上面的推导都假设\(s>0\),定义:
\[\begin{align}
\left\{ \begin{array}{l} g:R_+\leftarrow R \\ g(s)=-st+\frac{1}{8}s^2\sum_{i=1}^n(b_i-a_i)^2 \end{array} \right.
\end{align}
\]
\(g(s)\)是大家都很熟悉的上开口的抛物线函数,要使得上面的不等式对任意\(t>0\)都成立,显然应该对\(g(s)\)的最小值也成立。求解\(g^\prime(s)=0\)得到\(s=\frac{4t}{\sum_{i=1}^n(b_i-a_i)^2}\),代入不等式,即可得到:
\[\begin{align}
P(S_n-E[S_n]\geq t)\leq \exp(\frac{-2t^2}{\sum_{i=1}^n(b_i-a_i)^2})
\end{align}
\]
取\(S_n=-S_n\)即可得到:
\[\begin{align}
P(E[S_n]-S_n\geq t)\leq \exp(\frac{-2t^2}{\sum_{i=1}^n(b_i-a_i)^2})
\end{align}
\]
\(Hoeffding\)不等式得证!
泛化误差上界定理的证明
对任意函数\(f\in\mathcal{F}\),\(\hat{R}(f)\)是\(N\)个独立的随机变量\(L(Y,f(X))\)的样本均值,\(R(f)\)是随机变量\(L(Y,f(X))\)的期望值,如果损失函数取值于区间\([0,1]\),即对所有
\(i,[a_i,b_i]=[0,1]\),那么由\(Hoeffding\)不等式得知,对\(\epsilon>0\),以下不等式成立:
\[\begin{align}
P(R(f)-\hat{R}(f)\geq\epsilon)&=P(E(L(Y,f(X)))-\frac{1}{N}\sum_{i=1}^{N}L(Y_i,f(X_i))\geq\epsilon)\\
&=P(N\cdot E(L(Y,f(X)))-\sum_{i=1}^{N}L(Y_i,f(X_i))\geq N\cdot\epsilon)\\
&=P(N\cdot E(L(Y,f(X)))-S_n\geq N\cdot\epsilon)\\
\end{align}
\]
其中,\(S_n=\sum_{i=1}^{N}L(y_i,f(x_i))\),并且有:
\[\begin{align}
E(S_n)&=E[\sum_{i=1}^{N}L(y_i,f(x_i))]\\
&=\sum_{i=1}^{N}E[L(y_i,f(x_i))]\\
&=\sum_{i=1}^NL(y_i,f(x_i))\\
&=N\cdot E(L(Y,f(X)))\\
\end{align}
\]
于是不等式可以化为:
\[\begin{align}
P(R(f)-\hat{R}(f)\geq\epsilon)&=P(N\cdot E(L(Y,f(X)))-S_n\geq N\cdot\epsilon)\\
&=P(E(S_n)-S_n\geq N\cdot\epsilon)\\
&\leq\exp(\frac{-2(N\cdot\epsilon)^2}{\sum_{i=1}^{N}(b_i-a_i)^2})\\
&=\exp(\frac{-2(N\cdot\epsilon)^2}{N\cdot 1})=\exp(-2N\epsilon^2)\\
\end{align}
\]
上式选取的是\(\mathcal{F}\)中的任意一个\(f\),也就是说对任意的\(f\in\mathcal{F}\)都满足。那么对于\(\mathcal{F}={f_1,f_2,\cdots,f_d}\),存在一个\(f\)满足\(P(R(f)-\hat{R}(f)\geq\epsilon)\)的概率等于所有\(d\)个\(f\)各自满足这一条件的概率的并集,用公式表述就是:
\[\begin{align}
p(\exists f\in\mathcal{F}:R(f)-\hat{R}(f)\geq\epsilon)&=P(\bigcup_{f\in\mathcal{F}}\{ R(f)-\hat{R}(f)\geq\epsilon \})\\
&\leq \sum_{f\in\mathcal{F}}P(R(f)-\hat{R}(f)\geq\epsilon)\\
&\leq d\cdot \exp(-2N\epsilon^2)
\end{align}
\]
该表述的等价表述为,对任意的\(f\in\mathcal{F}\)有:
\[\begin{align}
P(\forall f\in\mathcal{F}:R(f)-\hat{R}(f)<\epsilon)\geq1-d\exp(-2N\epsilon^2)
\end{align}
\]
令\(\delta=d\exp(-2N\epsilon^2)\),则有:
\[\begin{align}
P(R(f)<\hat{R}(f)+\epsilon)\geq1-\delta
\end{align}
\]
即至少以概率\(1-\delta\)有\(R(f)<\hat{R}(f)+\epsilon\),其中由\(\delta=d\exp(-2N\epsilon^2)\)得到:
\[\begin{align}
\epsilon=\sqrt{\frac{1}{2N}(\log d-\log\delta)}=\epsilon(d,N,\delta)
\end{align}
\]
即最终得证泛化误差上界:
\[\begin{align}
R(f)\leq\hat{R}(f)+\epsilon(d,N,\delta)
\end{align}
\]
总结
至此,我们终于完整地证明了二分类问题的泛化误差上界。该定理表明,在该类问题中,训练误差越小,泛化误差也越小。这个能力证明了机器学习的模型确实对未知数据具有预测能力。