Note of Compression of Neural Machine Translation Models via Pruning

The problems of NMT Model

I
[Not supported by viewer]
am
[Not supported by viewer]
a
[Not supported by viewer]
student
[Not supported by viewer]
source language input
[Not supported by viewer]
-
[Not supported by viewer]
Je
[Not supported by viewer]
suis
[Not supported by viewer]
étudiant
[Not supported by viewer]
target language input
[Not supported by viewer]
Je
[Not supported by viewer]
suis
[Not supported by viewer]
étudiant
[Not supported by viewer]
-
[Not supported by viewer]
target language output
[Not supported by viewer]
  1. Over-Parameterization
  2. Long running time
  3. Overfitting
  4. Big Storage size

The redundancies of NMT Model

Most important: Higher Layers; Attention and Softmax Weights

redundancy: lower layers; embedding weights;

Traditional Solutions

Optimal Brain Damage (OBD) and Optimal Brain Surgeon(OBS)

Recent Ways

Magnitude based pruning with iterative retraining(基于幅度的剪枝与反复的重复训练)yielded strong results for Convolutional Neural Networks (CNN) performing visual tasks.

sparsity-inducing regularizers or ‘wiring together’ pairs of neurons with similar input weights

These approaches are much more constrained than weight-pruning schemes; they necessitate finding entire zero rows of weight matrices, or near-identical pairs of rows, in order to prune a single neuron.

weight-pruning approaches

weight-pruning approaches allow weights to be pruned freely and independently of each other

many other compression techniques for neural networks

  1. approaches based on on low-rank approximations for weight matrices;
  2. weight sharing via hash functions;

Understanding NMT Weights

Weight Subgroups in LSTM

details of LSTM:

\[\left(\begin{array}{c} {i} \\ {f} \\ {o} \\ {\hat{h}} \end{array}\right)=\left(\begin{array}{c} {\operatorname{sig} m} \\ {\operatorname{sig} m} \\ {\operatorname{sig} m} \\ {\tanh } \end{array}\right) T_{4 n, 2 n}\left(\begin{array}{c} {h_{t}^{l-1}} \\ {h_{t-1}^{l}} \end{array}\right) \]

we get \(\left(h_{t}^{l}, c_{t}^{l}\right)\) from the inputs of LSTM $\left(h_{t-1}^{l}, c_{t-1}^{l}\right) $

\[\begin{array}{l} {c_{t}^{l}=f \circ c_{t-1}^{l}+i \circ \hat{h}} \\ {h_{t}^{l}=o \circ \tanh \left(c_{t}^{l}\right)} \end{array} \]

\(T_{4 n, 2 n}\) is a matrix that is responsible for the parameters.

I
[Not supported by viewer]
am
[Not supported by viewer]
a
[Not supported by viewer]
student
[Not supported by viewer]
source language input
[Not supported by viewer]
-
[Not supported by viewer]
Je
[Not supported by viewer]
suis
[Not supported by viewer]
étudiant
[Not supported by viewer]
target language input
[Not supported by viewer]
Je
[Not supported by viewer]
suis
[Not supported by viewer]
étudiant
[Not supported by viewer]
-
[Not supported by viewer]
target language output
[Not supported by viewer]
one-hot vectors
 length V
[Not supported by viewer]
word embeddings
length n
[Not supported by viewer]
hidden layer 1
length n
[Not supported by viewer]
hidden layer 2
length n
[Not supported by viewer]
attention hidden layer
length n
[Not supported by viewer]
scores
length V
[Not supported by viewer]
one-hot vectors
length V
[Not supported by viewer]
initial (zero)
    states
[Not supported by viewer]
context vector
 (one for each
  target word)
     length n
[Not supported by viewer]

Pruning Schemes

Suppose we wish to prune x% of the total parameters in the model. How do we distribute the pruning over the different weight classes

  1. Class-blind: Take all parameters, sort them by magnitude and prune the \(x \%\) with smallest magnitude, regardless of weight class.
  2. Class-uniform: Within each class, sort the weights by magnitude and prune the \(x \%\) with smallest magnitude.

With class-uniform pruning, the overall performance loss is caused disproportionately by a few classes: target layer 4, attention and softmax weights; it seems that higher layers are more important than lower layers, and that attention and softmax weights are crucial

posted @ 2019-12-27 09:14  虾野百鹤  阅读(243)  评论(0编辑  收藏  举报