MADE: Masked Autoencoder for Distribution Estimation
概
考虑
\[\hat{x} = f(x) \in \mathbb{R}^D, \quad x \in \mathbb{R}^D.
\]
怎么样结构的\(f\)使得
\[\hat{x} = [\hat{x}_1, f_2(x_1), f_3(x_1, x_2), \ldots, f_d(x_1,x_2,\ldots,x_D)].
\]
即, \(\hat{x}_d\)只与\(x_{< d}\)有关.
主要内容
假设第\(l\)层的关系式为:
\[x^l = \sigma^l(W^lx^{l-1} + b^l).
\]
作者给出的思路是, 给一个隐层的第k个神经元分配一个数字\(m^1(k) \in \{1, \ldots, D-1\}\), 则构建一个掩码矩阵\(M^1\):
\[M^1_{k,d} =
\left \{
\begin{array}{ll}
1, & m^1(k) \ge d \\
0, & \mathrm{else}.
\end{array}
\right .
\]
于是实际上的过程为:
\[x^1 = \sigma^1(W^1 \odot M^1 \: x + b^1).
\]
进一步, 给第\(l\)个隐层的第\(i\)个神经元分配数字\(m^l(i) \in \{\min m^{l-1}, \ldots, D-1\}\) (否则会出现\(M^l\)的某些行全为0):
\[M^l_{i,j} =
\left \{
\begin{array}{ll}
1, & m^l(i) \ge m^{l-1}(j) \\
0, & \mathrm{else}.
\end{array}
\right .
\]
\[x^l = \sigma^l(W^l \odot M^l \: x^{l-1} + b^l).
\]
以及最后的输出层:
\[M^L_{d,k} =
\left \{
\begin{array}{ll}
1, & d > m^{L-1}(k) \\
0, & \mathrm{else}.
\end{array}
\right .
\]
个人感觉, 会有很明显的稀疏化... 而且越深状况越严重.