BinaryNet: Training Deep Neural Networks with Weights and ActivationsConstrained to +1 or −1

BinaryNet: Training Deep Neural Networks with Weights and ActivationsConstrained to +1 or −1

Abstract

We introduce BinaryNet, a method which trains DNNs with binary weights and activations when computing parameters’ gradient.

At run-time, BinaryNet drastically reduces memory usage and replaces most multiplications by 1-bit exclusive-not-or (XNOR) operations, which might have a big impact on both general-purpose and dedicated Deep Learning hardware.

We wrote a binary matrix multiplication GPU kernel with which it is possible to run our MNIST MLP 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The code for BinaryNet is available.

Introduction

Today, DNNs are almost exclusively trained on one or many very fast and power-hungry Graphic Processing Units (GPUs) (Coates et al., 2013). As a result, it is often a challenge to run DNNs on target low-power devices, and much research work is done to speed-up DNNs at run-time on both general-purpose (Vanhoucke et al., 2011; Gong et al., 2014; Romero et al., 2014; Han et al., 2015) and specialized computer hardware (Farabet et al., 2011a;b; Pham et al., 2012; Chen et al., 2014a;b; Esser et al., 2015).

the contributions of our article are the following:

  • We introduce BinaryNet, a method which trains DNNs with binary weights and activations when computing the parameters’ gradient (see Section 1)
  • We show that it is possible to train a Multi Layer Perceptron (MLP) on MNIST and ConvNets on CIFAR-10 and SVHN with BinaryNet and achieve nearly state-of-the-art results (see Section 2).
  • We show that, at run-time, BinaryNet drastically reduces memory usage and replaces most multiplications by 1-bit exclusive-not-or (XNOR) operations

1. BinaryNet

In this section, we detail our binarization function, how we use it to compute the parameters’ gradient and how we backpropagate through it.

Sign Function

BinaryNet constrains both the weights and the activations to either +1 or −1.
Our binarization function is simply the sign function:

\[x^b=Sign(x)=\begin{cases} +1, if x \geq 0\\ −1, if otherwise \end{cases}\]

Gradients computation and accumulation

A key point to understand about BinaryNet is that although we compute the parameters’ gradient using binary weights and activations, we nonetheless accumulate the weights’ real-valued gradient in real-valued variables, as per Algorithm 1.
对梯度的计算使用的是二值化的权重和激活量,但是权值的梯度任然使用的是真实值。(SGD算法仍然使用的浮点精度)

SGD explores the space of parameters by making small and noisy steps and that noise is averaged out by the stochastic gradient contributions accumulated in each weight. Therefore, it is important to keep sufficient resolution for these accumulators, which at first sight suggests that high precision is absolutely required.

Beside that, adding noise to weights and activations when computing the parameters’ gradient provides a form of regularization which can help to generalize better, as previously shown with variational weight noise (Graves, 2011), Dropout (Srivastava, 2013; Srivastava et al., 2014) and DropConnect (Wan et al., 2013).

Propagating Gradients Through Discretization

The derivative of the sign function is 0 almost everywhere, making it apparently incompatible with backpropagation, since exact gradients of the cost with respect to the quantities before the discretization (pre-activations or weights) would be zero.
符号函数导致反向传播存在问题。
They found in their experiments that the fastest training was obtained when using the “straight-through estimator”, previously introduced in Hinton (2012)’s lectures.
前人发现直通估计量(误差直接传导到下一层)可以很好的训练。

算法,内容较多比较详细,未细看。大致包括:

  • 用BinaryNet训练DNN的方法:参数的梯度计算(前向、反向)、参数梯度累加
  • 批处理归一化变换(Batch Normalizing Transform)
  • ADAM learning rule

A few helpful ingredients

A few elements of our experiments, although not absolutely necessary, significantly improve the accuracy of BinaryNets, as indicated in Algorithm 1

  • Batch Normalization (BN)
  • The ADAM learning rule
  • Lastly, scaling the weights’ learning rates with the weights’ initialization coefficients

2. Benchmark results

后面未读

posted @ 2018-02-07 18:16  Osler  阅读(516)  评论(0编辑  收藏  举报