关于GAN的推导
关于GAN的推导
Maximum Likelihood Estimation 最大似然估计
-
sample \(\{x^1,x^2,\ldots,x^m\}\) from \(P_{data}(x)\)
-
计算每个\(x^n\)来自于\(P_G(x;\theta)\)的概率
\[L=\prod_{i=1}^mP_G(x;\theta) \] -
找到一个\(\theta^*\)使得该概率的总和最大
from MLE to KL Divergence
\[\begin{aligned}
\theta^*&=arg\,\max_{\theta}\prod_{i=1}^m{P_G(x^i;\theta)}\\
&=arg\,\max_{\theta}\log\prod_{i=1}^m{P_G(x^i;\theta)}\\
&=arg\,\max_{\theta}\sum_{i=1}^m\log{P_G(x^i;\theta)}\\
&\approx arg\,\max_{\theta}E_{x\sim P_{data}}\log{P_G(x;\theta)}\\
&=arg\,\max_{\theta}\int_{x}P_{data}(x)\log{P_G(x;\theta)}dx\\
&=arg\,\max_{\theta}\int_{x}P_{data}(x)\log{P_G(x;\theta)}dx-\int_{x}P_{data}(x)\log{P_{data}(x)}dx\\
\\
&上式右半部分与\theta无关,不影响\theta取值\\
\\
&=arg\,\max_{\theta}\int_{x}P_{data}(x)\frac {P_G(x;\theta)}{P_{data}(x)}dx\\
&=arg\,\min_{\theta}KLD(P_{data}||P_G)\\
\end{aligned}\]
-
然而,\(P_{data}\)在高维空间中往往是低维流形,最大似然估计的方法使用的先验分布(例如高斯)无论\(\theta\)取任意值,都无法拟合目标的低维流形
-
GAN采用的方法,是通过生成器将先验分布\(Z\)中的向量映射到目标分布中
-
使用mapping network先将\(Z\)映射到中间分布\(W\)再到目标分布能够提升效果(参考StyleGAN论文)
Why is GAN minimizing JS Divergence
original GAN loss
\[LOSS=\min_{\theta}\max_{G}V(D,G)
\]
\[\begin{aligned}
V&=E_{x\sim P_{data}}[\log D(x)]+E_{x\sim P_{G}}[\log (1-D(x))]\\
&=\int_{x}P_{data}(x)\log{D(x)}dx+\int_{x}P_{G}(x)\log{(1-D(x))}dx\\
&=\int_{x}P_{data}(x)\log{D(x)}+P_{G}(x)\log{(1-D(x))}dx\\
\end{aligned}\]
\[每个x都是独立的个体,因此针对单个x,可以求出局部最优的D^*,令\\
a=P_{data}(x),b=P_{G}(x),D=D(x)
\]
\[Find\ D^*\ by\ maximizing:\ f(D)=a\log(D)+b\log(1-D)
\]
\[\frac{\mathrm{d}f(D)}{\mathrm{d}D}=\frac{a}{D}-\frac{b}{1-D}
\]
\[令\frac{\mathrm{d}f(D)}{\mathrm{d}D}=0\ \Rightarrow D^*=\frac{a}{a+b}\Rightarrow D^*=\frac{P_{data}(x)}{P_{data}(x)+P_{G}(x)}
\]
\[将D^*代入V中,得到
\]
\[\begin{aligned}
V&=\int_xP_{data}(x)\log{\frac{P_{data}(x)}{P_{data}(x)+P_{G}(x)}}dx+\int_xP_{G}(x)\log{\frac{P_{G}(x)}{P_{data}(x)+P_{G}(x)}}dx\\
&=-2\log2+\int_xP_{data}(x)\log{\cfrac{P_{data}(x)}{\cfrac{P_{data}(x)+P_{G}(x)}{2}}}dx+\int_xP_{G}(x)\log{\cfrac{P_{G}(x)}{\cfrac{P_{data}(x)+P_{G}(x)}{2}}}dx\\
&=-2\log2+KLD(P_{data}||\frac{P_{data}+P_{G}}{2})+KLD(P_{G}||\frac{P_{data}+P_{G}}{2})\\
&=-2\log2+2JSD(P_{data}||P_{G})
\end{aligned}\]
f-divergence
定义f-divergence
\[D_f(P||Q)=\int_xq(x)f(\frac{p(x)}{q(x)})dx
\]
\[当f\ is\ convex;\ f(1)=0,D_f满足非负^*,且P=Q时,D_f=0
\]
\[*:D_f(P||Q)\ge f(\int_xq(x)\frac{p(x)}{q(x)}dx)=f(1)=0
\]
Fenchel Conjugate
\[每个convex\ function\ f都有其对应的共轭convex函数f^*
\]
\[f^*(t)=\max_{x\in dom(f)}\{xt-f(x)\}
\]
\[其含义为xt-f(x)的上界
\]
\[求解f^*(t):将t视为常数,将xt-f(x)对x求导,求得导数为0时x关于t的表达式,代回到原式中消掉x
\]
\[同理有:f(x)=\max_{x\in dom(f)}\{xt-f^*(t)\}
\]
\[因此D_f(P||Q)=\int_xq(x)\left(\max_{t\in dom(f^*)}\left\{\frac{p(x)}{q(x)}t-f^*(t)\right\}\right)dx
\]
\[令D(x)=t,即输入x能让上式\max中的值最大,得到D_f的下界
\]
\[\begin{aligned}
D_f(P||Q)&\ge\int_xq(x)\left(\frac{p(x)}{q(x)}D(x)-f^*(D(x))\right)dx\\
&=\int_xp(x)D(x)dx-\int_xq(x)f^*(D(x))dx
\end{aligned}
\]
\[找到一个D(x)使得下界最大,将该最大下界来近似D_f
\]
\[\begin{aligned}
D_f&\approx \max_D\int_xp(x)D(x)dx-\int_xq(x)f^*(D(x))dx\\
&=\max_D\{E_{x\sim P}[D(x)]-E_{x\sim Q}[f^*(D(x))]\}\\
&=\max_D\{E_{x\sim P_{data}}[D(x)]-E_{x\sim P_{G}}[f^*(D(x))]\}
\end{aligned}\]
\[至此,将f^*代入不同的函数就能在GAN中使用不同的Divergence
\]