Note of Compression of Neural Machine Translation Models via Pruning

The problems of NMT Model

Over-Parameterization
Long running time
Overfitting
Big Storage size

The redundancies of NMT Model

Most important: Higher Layers; Attention and Softmax Weights

redundancy: lower layers; embedding weights;

Traditional Solutions

Optimal Brain Damage (OBD) and Optimal Brain Surgeon(OBS)

Recent Ways

Magnitude based pruning with iterative retraining（基于幅度的剪枝与反复的重复训练）yielded strong results for Convolutional Neural Networks (CNN) performing visual tasks.

sparsity-inducing regularizers or ‘wiring together’ pairs of neurons with similar input weights

These approaches are much more constrained than weight-pruning schemes; they necessitate finding entire zero rows of weight matrices, or near-identical pairs of rows, in order to prune a single neuron.

weight-pruning approaches

weight-pruning approaches allow weights to be pruned freely and independently of each other

many other compression techniques for neural networks

approaches based on on low-rank approximations for weight matrices;
weight sharing via hash functions;

Understanding NMT Weights

Weight Subgroups in LSTM

details of LSTM:

\[\left(\begin{array}{c} {i} \\ {f} \\ {o} \\ {\hat{h}} \end{array}\right)=\left(\begin{array}{c} {\operatorname{sig} m} \\ {\operatorname{sig} m} \\ {\operatorname{sig} m} \\ {\tanh } \end{array}\right) T_{4 n, 2 n}\left(\begin{array}{c} {h_{t}^{l-1}} \\ {h_{t-1}^{l}} \end{array}\right) \]

we get $\left(h_{t}^{l}, c_{t}^{l}\right)$ from the inputs of LSTM $\left(h_{t-1}^{l}, c_{t-1}^{l}\right) $

\[\begin{array}{l} {c_{t}^{l}=f \circ c_{t-1}^{l}+i \circ \hat{h}} \\ {h_{t}^{l}=o \circ \tanh \left(c_{t}^{l}\right)} \end{array} \]

$T_{4 n, 2 n}$ is a matrix that is responsible for the parameters.

Pruning Schemes

Suppose we wish to prune x% of the total parameters in the model. How do we distribute the pruning over the different weight classes

Class-blind： Take all parameters, sort them by magnitude and prune the $x \%$ with smallest magnitude, regardless of weight class.
Class-uniform： Within each class, sort the weights by magnitude and prune the $x \%$ with smallest magnitude.

With class-uniform pruning, the overall performance loss is caused disproportionately by a few classes: target layer 4, attention and softmax weights; it seems that higher layers are more important than lower layers, and that attention and softmax weights are crucial

posted @ 2019-12-27 09:14 虾野百鹤阅读(247) 评论(0) 编辑收藏举报

刷新页面返回顶部

行远自迩登高自卑

Note of Compression of Neural Machine Translation Models via Pruning

The problems of NMT Model

Traditional Solutions

Recent Ways

many other compression techniques for neural networks

Understanding NMT Weights

Weight Subgroups in LSTM

Pruning Schemes

公告

行远自迩 登高自卑

Note of Compression of Neural Machine Translation Models via Pruning

The problems of NMT Model

Traditional Solutions

Recent Ways

many other compression techniques for neural networks

Understanding NMT Weights

Weight Subgroups in LSTM

Pruning Schemes

公告

行远自迩登高自卑