Loading

Hard-Concrete Distribution

The Hard Concrete Distribution (Hard-sigmoid+Concrete)

In order to overcome the issues(over parameter and overfitting) brought by the dense weights of the parameters. We use the regularization to prune the network. Traditionally, we use the Lasso (\(L_1\)) or Ridge (\(L_2\)) regularization. However what we really need is the \(L_0\) regularization. The reason we do not use the \(L_0\) regularization is because the \(L_0\) norm of the weights is not differentiable, we cannot incorporate it directly as a regularization term in the objective function. Also if the dimension of the weights is too big it will be an NP-hard question. So some smart guys come up with the hard concrete distribution for the gates, which is obtained by “stretching” a binary concrete distribution and then transforming its samples with a hard-sigmoid. The parameters of the distribution over the gates can then be jointly optimized with the original network parameters. As a result the method allows for straightforward and efficient learning of model structures with stochastic gradient descent and allows for conditional computation in a principled way.

Introduction

A way to address both the over parameter and over fitting is by employing model compression and sparsification tech- niques. By sparsifying the model, we can avoid unnecessary computation and resources, since irrelevant degrees of freedom are pruned away and do not need to be computed. Furthermore, we reduce its complexity, thus penalizing memorization and alleviating overfitting.

A conceptually attractive approach is the L0 norm regularization of (blocks of) parameters as below.

\[\begin{gathered} \mathcal{R}(\boldsymbol{\theta})=\frac{1}{N}\left(\sum_{i=1}^{N} \mathcal{L}\left(h\left(\mathbf{x}_{i} ; \boldsymbol{\theta}\right), \mathbf{y}_{i}\right)\right)+\lambda\|\boldsymbol{\theta}\|_{0}, \quad\|\boldsymbol{\theta}\|_{0}=\sum_{j=1}^{|\theta|} \mathbb{I}\left[\theta_{j} \neq 0\right], \\ \boldsymbol{\theta}^{*}=\underset{\boldsymbol{\theta}}{\arg \min }\{\mathcal{R}(\boldsymbol{\theta})\} \end{gathered} \]

The first part of the r.h.s. is the loss function, meaning the prediction error of our model and the real value. The second part of the r.h.s. is the penalization of the weights.

The \(L_0\) norm penalizes the number of non-zero entries of the weights and thus encourages sparsity in the final estimates \(θ^*\). Notice that the \(L_0\) norm induces no shrinkage on the actual values of the parameters θ; this is in contrast to e.g. \(L_1\) regularization and the Lasso (Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996.), where the sparsity is due to shrinking the actual values of \(\theta\).

Unfortunately, optimization under this penalty is computationally intractable due to the non- differentiability and combinatorial nature of \(2^{|θ|}\) possible states of the parameter vector θ. How can
we relax the discrete nature of the \(L_0\) penalty such that we allow for efficient continuous optimization of the above mentioned equation, while allowing for exact zeros in the parameters?

Consider the \(L_0\) norm under a simple re-parametrization of \(θ\):

\[\theta_{j}=\tilde{\theta}_{j} z_{j}, \quad z_{j} \in\{0,1\}, \quad \tilde{\theta}_{j} \neq 0, \quad\|\boldsymbol{\theta}\|_{0}=\sum_{j=1}^{|\theta|} z_{j} \]

where the \(z_j\) correspond to binary “gates” that denote whether a parameter is present and the \(L_0\) norm corresponds to the amount of gates being “on”. By letting \(q(z_j |π_j ) = Bern(π_j )\) be a Bernoulli distribution over each gate \(z_j\) we can reformulate the minimization as penalizing the number of parameters being used, on average, as follows:

\[\begin{gathered} \mathcal{R}(\tilde{\boldsymbol{\theta}}, \boldsymbol{\pi})=\mathbb{E}_{q(\mathbf{z} \mid \boldsymbol{\pi})}\left[\frac{1}{N}\left(\sum_{i=1}^{N} \mathcal{L}\left(h\left(\mathbf{x}_{i} ; \tilde{\boldsymbol{\theta}} \odot \mathbf{z}\right), \mathbf{y}_{i}\right)\right)\right]+\lambda \sum_{j=1}^{|\theta|} \pi_{j}, \\ \tilde{\boldsymbol{\theta}}^{*}, \boldsymbol{\pi}^{*}=\underset{\tilde{\boldsymbol{\theta}}, \boldsymbol{\pi}}{\arg \min }\{\mathcal{R}(\tilde{\boldsymbol{\theta}}, \boldsymbol{\pi})\} \end{gathered} \]

Now the second term of the r.h.s. of the above equation is straightforward to minimize, however the first term is problematic for \(\pi\) due to the discrete nature of \(z\), which does not allow for efficient gradient based optimization. For example, if the \(|\theta|\) is 100, then we will have \(2^{100}\) different discrete proabulites. So it would make the optimization more convenient if we can find or relax the discrete distribution proability of the \(\mathcal{R}(\tilde{\boldsymbol{\theta}}, \boldsymbol{\pi})\) to a continuous distribution or we use specific methods to approximate the gradients of the discrete distribution. While in principle a gradient estimator such as the REINFORCE (Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.) could be employed, it suffers from high variance and control variates (Andriy Mnih and Danilo Rezende. Variational inference for monte carlo objectives. In International Conference on Machine Learning, pp. 2188–2196, 2016.), that require auxiliary models or multiple evaluations of the network, have to be employed. Two simpler alternatives would be to use either the straight- through (Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.) estimator as done at Srinivas et al. (2017) or the concrete distribution as e.g. at (Yarin Gal, Jiri Hron, and Alex Kendall. Concrete dropout. arXiv preprint arXiv:1705.07832, 2017.). Unfortunately both of these approach have drawbacks; the first one provides biased gradients due to ignoring the Heaviside function in the likelihood during the gradient evaluation whereas the second one does not allow for the gates (and hence parameters) to be exactly zero during optimization, thus precluding the benefits of conditional computation.

Fortunately, there is a simple alternative way to smooth the objective such that we allow for efficient gradient based optimization of the expected \(L_0\) norm along with zeros in the parameters θ. Let s be a continuous random variable with a distribution q(s) that has parameters φ. We can now let the gates \(z\) be given by a hard-sigmoid rectification of \(s^2\), as follows:

\[\begin{aligned} &\mathbf{s} \sim q(\mathbf{s} \mid \boldsymbol{\phi}) \\ &\mathbf{z}=\min (\mathbf{1}, \max (\mathbf{0}, \mathbf{s})) \end{aligned} \]

This would then allow the gate to be exactly zero and, due to the underlying continuous random variable \(s\), we can still compute the probability of the gate being non-zero (active). This is easily obtained by the cumulative distribution function (CDF) \(Q(·)\) of \(s\):

\[q(\mathbf{z} \neq 0 \mid \boldsymbol{\phi})=1-Q(\mathbf{s} \leq 0 \mid \boldsymbol{\phi}) \]

i.e. it is the probability of the s variable being positive. We can thus smooth the binary Bernoulli gates \(z\) appearing in loss function by employing continuous distributions in the aforementioned way:

\[\begin{gathered} \mathcal{R}(\tilde{\boldsymbol{\theta}}, \boldsymbol{\phi})=\mathbb{E}_{q(\mathbf{s} \mid \boldsymbol{\phi})}\left[\frac{1}{N}\left(\sum_{i=1}^{N} \mathcal{L}\left(h\left(\mathbf{x}_{i} ; \tilde{\boldsymbol{\theta}} \odot g(\mathbf{s})\right), \mathbf{y}_{i}\right)\right)\right]+\lambda \sum_{j=1}^{|\theta|}\left(1-Q\left(s_{j} \leq 0 \mid \phi_{j}\right)\right) \\ \tilde{\boldsymbol{\theta}}^{*}, \boldsymbol{\phi} *=\underset{\tilde{\boldsymbol{\theta}}, \phi}{\arg \min }\{\mathcal{R}(\tilde{\boldsymbol{\theta}}, \boldsymbol{\phi})\}, \quad g(\cdot)=\min (1, \max (0, \cdot)) \end{gathered} \]

Notice that this is a close surrogate to the original objective function in loss function, as we similarly have a cost that explicitly penalizes the probability of a gate being different from zero.

Now for continuous distributions \(q(s)\) that allow for the reparameterization trick we can express the objective in above equation as an expectation over a parameter free noise distribution \(p(\epsilon)\) and a deterministic and differentiable transformation \(f (·)\) of the parameters \(\phi\) and \(\epsilon\):

\[\mathcal{R}(\tilde{\boldsymbol{\theta}}, \boldsymbol{\phi})=\mathbb{E}_{p(\boldsymbol{\epsilon})}\left[\frac{1}{N}\left(\sum_{i=1}^{N} \mathcal{L}\left(h\left(\mathbf{x}_{i} ; \tilde{\boldsymbol{\theta}} \odot g(f(\boldsymbol{\phi}, \boldsymbol{\epsilon}))\right), \mathbf{y}_{i}\right)\right)\right]+\lambda \sum_{j=1}^{|\theta|}\left(1-Q\left(s_{j} \leq 0 \mid \phi_{j}\right)\right) \]

which allows us to make the following Monte Carlo approximation to the (generally) intractable expectation over the noise distribution \(p(\epsilon)\):

\[\begin{aligned} \hat{\mathcal{R}}(\tilde{\boldsymbol{\theta}}, \boldsymbol{\phi}) &=\frac{1}{L} \sum_{l=1}^{L}\left(\frac{1}{N}\left(\sum_{i=1}^{N} \mathcal{L}\left(h\left(\mathbf{x}_{i} ; \tilde{\boldsymbol{\theta}} \odot \mathbf{z}^{(l)}\right), \mathbf{y}_{i}\right)\right)\right)+\lambda \sum_{j=1}^{|\theta|}\left(1-Q\left(s_{j} \leq 0 \mid \phi_{j}\right)\right) \\ &=\mathcal{L}_{E}(\tilde{\boldsymbol{\theta}}, \boldsymbol{\phi})+\lambda \mathcal{L}_{C}(\boldsymbol{\phi}), \quad \text { where } \mathbf{z}^{(l)}=g\left(f\left(\phi, \boldsymbol{\epsilon}^{(l)}\right)\right) \text { and } \boldsymbol{\epsilon}^{(l)} \sim p(\boldsymbol{\epsilon}) \end{aligned} \]

Then we need to define a proper \(q(s)\) which is distribution of the \(s\). It should be a continuous distribution and we use the reparameterization to sample the \(s\). We can choose many different distributions, but in practice, the Concrete Distribution acquire the better performance. The concrete distribution is distributed among the interval \((0,1)\) with pdf \(=q_s(s|\phi)\) and cdf \(= Q_s(s|\phi)\). The parameters of the \(\phi\) are \(\phi=(\log(\alpha),\beta)\), where the \(\log(\alpha)\) is the location and the \(\beta\) is the temperature. We then stretch the interval from the \((0,1)\) to the \((\gamma,\zeta)\), with \(\gamma\leq0,\ \zeta\geq 1\). And then appl

\[\begin{gathered} u \sim \mathcal{U}(0,1), \quad s=\operatorname{Hard-Sigmoid}((\log u-\log (1-u)+\log \alpha) / \beta), \quad \bar{s}=s(\zeta-\gamma)+\gamma \\ z=\min (1, \max (0, \bar{s})) \end{gathered} \]

This would then induce a distribution where the probability mass of
\(q_{\overline{s}}(\overline {x}|\phi)\)
on the negative values, \(Q_{\overline{s}}(0|\phi)\), is “folded” to a delta peak at zero, the probability mass on values larger than one, \(1-Q_{\overline{s}}(1|\phi)\), is “folded” to a delta peak at one and the original distribution \(q_{\overline{s}}(\overline {x}|\phi)\) is truncated to the \((0, 1)\) range.

Finally, The \(L_0\) complexity loss of the objective in loss function under the hard concrete r.v. is conveniently expressed as follows:

\[\mathcal{L}_{C}=\sum_{j=1}^{|\theta|}\left(1-Q_{\bar{s}_{j}}(0 \mid \phi)\right)=\sum_{j=1}^{|\theta|} \operatorname{Hard-Sigmoid}\left(\log \alpha_{j}-\beta \log \frac{-\gamma}{\zeta}\right) \]

At test time we use the following estimator for the final parameters \(θ^∗\) under a hard concrete gate:

\[\hat{\mathbf{z}}=\min (\mathbf{1}, \max (\mathbf{0}, \operatorname{Sigmoid}(\log \boldsymbol{\alpha})(\zeta-\gamma)+\gamma)), \quad \boldsymbol{\theta}^{*}=\tilde{\boldsymbol{\theta}}^{*} \odot \hat{\mathbf{z}} \]

posted @ 2022-05-14 16:15  晓哥爱咖啡  阅读(967)  评论(0编辑  收藏  举报