Note of Compression of Neural Machine Translation Models via Pruning
The problems of NMT Model
- Over-Parameterization
- Long running time
- Overfitting
- Big Storage size
The redundancies of NMT Model
Most important: Higher Layers; Attention and Softmax Weights
redundancy: lower layers; embedding weights;
Traditional Solutions
Optimal Brain Damage (OBD) and Optimal Brain Surgeon(OBS)
Recent Ways
Magnitude based pruning with iterative retraining(基于幅度的剪枝与反复的重复训练)yielded strong results for Convolutional Neural Networks (CNN) performing visual tasks.
sparsity-inducing regularizers or ‘wiring together’ pairs of neurons with similar input weights
These approaches are much more constrained than weight-pruning schemes; they necessitate finding entire zero rows of weight matrices, or near-identical pairs of rows, in order to prune a single neuron.
weight-pruning approaches
weight-pruning approaches allow weights to be pruned freely and independently of each other
many other compression techniques for neural networks
- approaches based on on low-rank approximations for weight matrices;
- weight sharing via hash functions;
Understanding NMT Weights
Weight Subgroups in LSTM
details of LSTM:
we get \(\left(h_{t}^{l}, c_{t}^{l}\right)\) from the inputs of LSTM $\left(h_{t-1}^{l}, c_{t-1}^{l}\right) $
\(T_{4 n, 2 n}\) is a matrix that is responsible for the parameters.
Pruning Schemes
Suppose we wish to prune x% of the total parameters in the model. How do we distribute the pruning over the different weight classes
- Class-blind: Take all parameters, sort them by magnitude and prune the \(x \%\) with smallest magnitude, regardless of weight class.
- Class-uniform: Within each class, sort the weights by magnitude and prune the \(x \%\) with smallest magnitude.
With class-uniform pruning, the overall performance loss is caused disproportionately by a few classes: target layer 4, attention and softmax weights; it seems that higher layers are more important than lower layers, and that attention and softmax weights are crucial