二分类问题泛化误差上界的详细证明

定理描述

对二分类问题,当假设空间是有限个函数的集合\(\mathcal{F}=\{f_1,f_2,\cdots,f_d\}\)时,对任意一个函数\(f\in\mathcal{F}\),至少以概率\(1-\delta\)使得以下不等式成立:
\(R(f)\leq\hat{R}(f)+\epsilon(d,N,\delta)\)
其中,
\(\epsilon(d,N,\delta)=\sqrt{\frac{1}{2N}(\log d+\log\frac{1}{\delta})}\)
证明该公式需要用到\(Hoeffding\)定理

\(Hoeffding\)不等式

假设\(X_1,X_2,\cdots,X_n\)是独立随机变量,满足\(P(X_i\in[a_i,b_i])=1,1\leq i\leq n\),令\(S_n=\sum_{i=1}^{n}X_i\),则对任意的\(t>0\),以下不等式成立:

\[\begin{align} P(S_n-E[S_n]\geq t)\leq &\exp(\frac{-2t^2}{\sum_{i=1}^n(b_i-a_i)^2})\\ P(E[S_n]-S_n\geq t)\leq &\exp(\frac{-2t^2}{\sum_{i=1}^n(b_i-a_i)^2}) \end{align} \]

\(Hoeffding\)不等式的证明

先证明\(Hoeffding\)不等式的一个引理:

引理

对于一个随机变量\(X\),如果\(P(X\in [a,b])=1,E(X)=0\),则对任意\(s>0\)有:

\[\begin{align} E(e^{sX})\leq e^{\frac{1}{8}s^{2}(b-a)^{2}} \end{align} \]

证明:

  • 首先,若\(a=b=0\),根据题设显然有\(P(X=0)=1\),那么:
    \(E(e^{sX})=\int_0^0 p(x)\cdot e^{sx}dx=1\leq e^{\frac{1}{8}s^2(0-0)^2}=1\)
  • 又若\(a=b\neq0\),那么\(E(X)=\int_a^a x\cdot p(x)dx=a\neq0\),与题设矛盾,所以必不可能有\(a=b\neq0\)
  • \(a=0\),由于\(E(x)=\int_{0}^{b}x\cdot P(x)dx=0\cdot P(0)+\int_{0^+}^{b}x\cdot P(x)dx\)
    上式右半部分\(\int_{0^+}^{b}x\cdot P(x)dx\)满足\(x>0,P(x)\geq 0\),所有应有\(\int_{0^+}^{b}x\cdot P(x)dx\geq 0\)
    又根据题设\(E(X)=0\),所有必有\(P(X=0)=1,P(X\neq 0)=0\),于是:

\[\begin{align} E(e^{sX})&=\int_{0}^{b}p(x)\cdot e^{sx}dx\\ &=p(0)\cdot e^{s\cdot 0}+\int_{0^+}^{b}p(x)\cdot e^{sx}dx\\ &=1\cdot e^{0}+0=1\\ &\leq e^{\frac{1}{8}s^2b^2}=e^{\frac{1}{8}s^2(b-a)^2} \end{align} \]

  • 考虑剩下的情况,此时根据\(E(X)=\int_a^b x\cdot p(x)dx=0\),必有\(a<0,b\geq 0\)
    注意到\(e^{sX}\)是关于\(X\)的一个凸函数,所以根据\(Jensen\)不等式有:

\[\begin{align} e^{(\frac{b-X}{b-a}sa+\frac{X-a}{b-a}sb)}&=e^{sX}\\ &\leq \frac{b-X}{b-a}e^{sa}+\frac{X-a}{b-a}e^{sb} \end{align} \]

       两边同时对\(X\)取期望,并代入\(E(X)=0\)得到:

\[\begin{align} E(e^{sX})&\leq \frac{b-E(X)}{b-a}e^{sa}+\frac{E(X)-a}{b-a}e^{sb}\\ &=\frac{b}{b-a}e^{sa}-\frac{a}{b-a}e^{sb}\\ &=(-\frac{a}{b-a})e^{sa}(e^{sb-sa}-\frac{b}{a}) \end{align} \]

       令\(\theta=-\frac{a}{b-a}>0\),上式右边就变成了:

\[\begin{align} \theta e^{-s\theta(b-a)}(\frac{1}{\theta}-1+e^{s(b-a)})&=(1-\theta+\theta e^{s(b-a)})e^{-s\theta (b-a)}\\ &=e^{\log[1-\theta+\theta e^{s(b-a)}]e^{-s\theta(b-a)}}\\ &=e^{-s\theta(b-a)+\log[1-\theta+\theta e^{s(b-a)}]}\\ \end{align} \]

       令\(u=s(b-a)\),并且定义\(\varphi\)

\[\begin{align} \left\{ \begin{array}{l} \varphi : R\rightarrow R\\ \varphi(u)=-\theta u+\log(1-\theta+\theta e^{u}) \end{array} \right. \end{align} \]

       由\(e^{u}>0,a<0,b\geq0,\theta>0\),有:\(1-\theta+\theta e^{u}=\theta(\frac{1}{\theta}-1+e^{u})=\theta(-\frac{b}{a}+e^{u})>0\),所以\(\varphi\)的定义是合理的。
       将\(\varphi\)代入\(E(e^{sX})\)得到:

\[\begin{align} E(e^{sX})\leq e^{\varphi(u)} \end{align} \]

       对\(\varphi\)进行泰勒中值定理展开,存在一个\(v\in[0,u]\)使得:

\[\begin{align} \varphi(u)=\varphi(0)+u\varphi^{\prime}(0)+\frac{u^{2}}{2!}\varphi^{\prime\prime}(v) \end{align} \]

       计算得到:

\[\begin{align} \varphi(0)&=0\\ \varphi^{\prime}(0)&=-\theta+\frac{\theta e^{u}}{1-\theta+\theta e^{u}}|_{u=0}=0\\ \varphi^{\prime\prime}(v)&=\frac{\theta e^{u}(1-\theta+\theta e^{u})-\theta e^{u}\theta e^{u}}{(1-\theta+\theta e^{u})^{2}}|_{u=v}\\ &=\frac{(1-\theta)\theta e^{v}}{(1-\theta+\theta e^{v})^{2}}\\ &=\frac{1-\theta}{1-\theta+\theta e^{v}}\cdot\frac{\theta e^{v}}{1-\theta+\theta e^{v}}\\ &=t(1-t)\leq\frac{1}{4} \end{align} \]

      其中,\(t=\frac{1-\theta}{1-\theta+\theta e^{v}}\)
      因此得到:

\[\begin{align} \varphi(u)\leq0+0+\frac{1}{2}u^{2}*\frac{1}{4}=\frac{1}{8}u^{2}=\frac{1}{8}s^{2}(b-a)^{2} \end{align} \]

      引理得证!

\(Markov\)不等式

接下来证明需要用到\(Markov\)不等式,该不等式属于概率论与数理统计课程的必修内容,相信难不倒大部分的读者。这里还是将定理和证明誊抄在下。
\(X\)为非负随机变量,且假设\(E(X)\)存在,则对任意\(t>0\),有:

\[\begin{align} P(X\geq t)\leq\frac{E(X)}{t} \end{align} \]

证明如下:
假设\(X\in[a,b],a\geq 0\),容易得到:

\[\begin{align} a=a\cdot\int_a^b p(x)dx\leq E(X)=\int_a^b x\cdot p(x)dx\leq b\cdot\int_a^b p(x)dx=b\\ \end{align} \]

\(a\leq E(X)\leq b\)
如果\(t\leq a\)\(P(X\geq t)=1=\frac{t}{t}\leq\frac{a}{t}\leq\frac{E(X)}{t}\)
如果\(t\geq b\)\(P(X\geq t)=0=\frac{0}{t}\leq\frac{a}{t}\leq\frac{E(X)}{t}\)
如果\(t\in(a,b)\),有:

\[\begin{align} E(x)&=\int_a^b x\cdot p(x)dx\\ &=\int_a^t x\cdot p(x)dx + \int_t^b x\cdot p(x)dx\\ &\geq \int_t^b x\cdot p(x)dx\\ &\geq t\cdot \int_t^b p(x)dx\\ &=t\cdot P(X\geq t) \end{align} \]

\(Markov\)不等式得证!

接下来证明\(Hoeffding\)不等式
对于\(X_1,X_2,\cdots,X_n,n\)个独立的随机变量,其中\(P(X_i\in[a_i,b_i])=1,1\leq i\leq n\),令\(S_n=\sum_{i=1}^{n}X_i\),根据\(Markov\)不等式,有:

\[\begin{align} P(S_n-E[S_n]\geq t)&=P(e^{s(S_n-E[S_n])}\geq e^{st})\\ &\leq e^{-st}E[e^{s(S_n-E[S_n])}]\\ &=e^{-st}E[e^{s(\sum_{i=1}^{n}X_i-E[\sum_{i=1}^{n}X_i])}]\\ &=e^{-st}E[e^{s(\sum_{i=1}^{n}(X_i-E(X_i)))}]\\ &=e^{-st}E[\prod_{i=1}^{n}e^{s(X_i-E(X_i))}]\\ &=e^{-st}\prod_{i=1}^nE[e^{s(X_i-E(X_i))}]\\ \end{align} \]

\(Y_i=X_i-E(X_i)\),有:

\[\begin{align} E(Y_i)=E(X_i-E(X_i))&=\int_{a_i}^{b_i} p(x_i)\cdot (x_i-E(X_i))dx_i\\ &=\int_{a_i}^{b_i}p(x_i)\cdot x_i dx - \int_{a_i}^{b_i}p(x_i)\cdot E(X_i)dx_i\\ &=E(X_i)-E(X_i)\int_{a_i}^{b_i}p(x_i)dx_i\\ &=E(X_i)-E(X_i)=0 \end{align} \]

满足引理的条件,所以上式可化为:

\[\begin{align} P(S_n-E[S_n]\geq t)&\leq e^{-st}\prod_{i=1}^nE[e^{s(Y_i)}]\\ &\leq e^{-st}\prod_{i=1}^{n}e^{\frac{1}{8}s^2(b_i-a_i)^2}\\ &=\exp(-st+\frac{1}{8}s^2\sum_{i=1}^n(b_i-a_i)^2)\\ \end{align} \]

上面的推导都假设\(s>0\),定义:

\[\begin{align} \left\{ \begin{array}{l} g:R_+\leftarrow R \\ g(s)=-st+\frac{1}{8}s^2\sum_{i=1}^n(b_i-a_i)^2 \end{array} \right. \end{align} \]

\(g(s)\)是大家都很熟悉的上开口的抛物线函数,要使得上面的不等式对任意\(t>0\)都成立,显然应该对\(g(s)\)的最小值也成立。求解\(g^\prime(s)=0\)得到\(s=\frac{4t}{\sum_{i=1}^n(b_i-a_i)^2}\),代入不等式,即可得到:

\[\begin{align} P(S_n-E[S_n]\geq t)\leq \exp(\frac{-2t^2}{\sum_{i=1}^n(b_i-a_i)^2}) \end{align} \]

\(S_n=-S_n\)即可得到:

\[\begin{align} P(E[S_n]-S_n\geq t)\leq \exp(\frac{-2t^2}{\sum_{i=1}^n(b_i-a_i)^2}) \end{align} \]

\(Hoeffding\)不等式得证!

泛化误差上界定理的证明

对任意函数\(f\in\mathcal{F}\)\(\hat{R}(f)\)\(N\)个独立的随机变量\(L(Y,f(X))\)的样本均值,\(R(f)\)是随机变量\(L(Y,f(X))\)的期望值,如果损失函数取值于区间\([0,1]\),即对所有
\(i,[a_i,b_i]=[0,1]\),那么由\(Hoeffding\)不等式得知,对\(\epsilon>0\),以下不等式成立:

\[\begin{align} P(R(f)-\hat{R}(f)\geq\epsilon)&=P(E(L(Y,f(X)))-\frac{1}{N}\sum_{i=1}^{N}L(Y_i,f(X_i))\geq\epsilon)\\ &=P(N\cdot E(L(Y,f(X)))-\sum_{i=1}^{N}L(Y_i,f(X_i))\geq N\cdot\epsilon)\\ &=P(N\cdot E(L(Y,f(X)))-S_n\geq N\cdot\epsilon)\\ \end{align} \]

其中,\(S_n=\sum_{i=1}^{N}L(y_i,f(x_i))\),并且有:

\[\begin{align} E(S_n)&=E[\sum_{i=1}^{N}L(y_i,f(x_i))]\\ &=\sum_{i=1}^{N}E[L(y_i,f(x_i))]\\ &=\sum_{i=1}^NL(y_i,f(x_i))\\ &=N\cdot E(L(Y,f(X)))\\ \end{align} \]

于是不等式可以化为:

\[\begin{align} P(R(f)-\hat{R}(f)\geq\epsilon)&=P(N\cdot E(L(Y,f(X)))-S_n\geq N\cdot\epsilon)\\ &=P(E(S_n)-S_n\geq N\cdot\epsilon)\\ &\leq\exp(\frac{-2(N\cdot\epsilon)^2}{\sum_{i=1}^{N}(b_i-a_i)^2})\\ &=\exp(\frac{-2(N\cdot\epsilon)^2}{N\cdot 1})=\exp(-2N\epsilon^2)\\ \end{align} \]

上式选取的是\(\mathcal{F}\)中的任意一个\(f\),也就是说对任意的\(f\in\mathcal{F}\)都满足。那么对于\(\mathcal{F}={f_1,f_2,\cdots,f_d}\),存在一个\(f\)满足\(P(R(f)-\hat{R}(f)\geq\epsilon)\)的概率等于所有\(d\)\(f\)各自满足这一条件的概率的并集,用公式表述就是:

\[\begin{align} p(\exists f\in\mathcal{F}:R(f)-\hat{R}(f)\geq\epsilon)&=P(\bigcup_{f\in\mathcal{F}}\{ R(f)-\hat{R}(f)\geq\epsilon \})\\ &\leq \sum_{f\in\mathcal{F}}P(R(f)-\hat{R}(f)\geq\epsilon)\\ &\leq d\cdot \exp(-2N\epsilon^2) \end{align} \]

该表述的等价表述为,对任意的\(f\in\mathcal{F}\)有:

\[\begin{align} P(\forall f\in\mathcal{F}:R(f)-\hat{R}(f)<\epsilon)\geq1-d\exp(-2N\epsilon^2) \end{align} \]

\(\delta=d\exp(-2N\epsilon^2)\),则有:

\[\begin{align} P(R(f)<\hat{R}(f)+\epsilon)\geq1-\delta \end{align} \]

即至少以概率\(1-\delta\)\(R(f)<\hat{R}(f)+\epsilon\),其中由\(\delta=d\exp(-2N\epsilon^2)\)得到:

\[\begin{align} \epsilon=\sqrt{\frac{1}{2N}(\log d-\log\delta)}=\epsilon(d,N,\delta) \end{align} \]

即最终得证泛化误差上界:

\[\begin{align} R(f)\leq\hat{R}(f)+\epsilon(d,N,\delta) \end{align} \]

总结

至此,我们终于完整地证明了二分类问题的泛化误差上界。该定理表明,在该类问题中,训练误差越小,泛化误差也越小。这个能力证明了机器学习的模型确实对未知数据具有预测能力。

posted @ 2020-03-28 20:11  p_is_p  阅读(1344)  评论(1编辑  收藏  举报