Weight pruning is another line of work to compress BERT
style models. In this post, I will survey BERT related
weight pruning papers.
In general, weight pruning means identifying and removing
redundant or less essential weights and/or components.
Related work can be divided into three categories:
Prove the weights that are indeed redundant and prunable.
Element-wise pruning.
Structured pruning.
Just like the data quantization,
one characteristic of this line of work is, after pruning,
the weights usually become very sparse. Sparse weights can
reduce the model size largely. However, it can not speed up
the inference time. As a result, special hardware is needed
to achieve a decent computation speed-up after element-wise
pruning.
Prove Weights are Redundant
This line of work empirically shows that weights of large
transformer-based models are indeed redundant. A significant
part of weights can be removed without hurting the final
performance.
Kovaleva et al. conducts \(6\) different experiments
regarding the attentions on BERT model. The authors conclude
that BERT model is highly over-parametered by showing a
repeated self-attention pattern in different heads.
Moreover, in their disabling experiments, they find that
both single and multiple heads are not detrimental to model
performance and, in some cases, even improves it. Their
experiments also show that the last two layers encode the
task-specific features and contribute to the performance
gain.
Similarly, Michel et al. also conducts extensive
experiments to analyze the importance of attention heads in
transformer-based models. They perform their experiments on
WMT and
BERT. In their
first setting, they remove each head to see the impact of
the removed head. They find that the majority of attention
heads can be removed without hurting too much from the
original score. They also find in some cases, removing an
attention head can increase performance, which is consistent
with the conclusions of Kovaleva et al.. To answer
the question that is more than one head is needed? they
remove all attention heads but one within a single layer.
Surprisingly, they find that one head is indeed sufficient
at test time. Next, they also try to prune these attention
heads iteratively. Experiments show that removing \(20\%\) to
\(40\%\) of heads does not incur any noticeable negative
impact.
Consistent results are also found in Voita et al.. But in Voita et al., they
take a further step showing that some “important heads” can
be characterized into 3 functions: (i) positional (ii)
syntactic, and (iii) rare words. Besides, they also propose
a new pruning method by applying a regularized objective
when fine-tuning. They observe that these specialized heads
are often the last to be pruned, which confirms their
importance.
Element-wise Pruning
After empirically showing the weights are indeed redundant,
many pruning methods are proposed—one line of work focus on
identifying and removing individual weights of the given
model. The importance of each weight can be determined by
its absolute value, gradients, or other measurements defined
by designers.
Gordon et al. apply a simple Magnitude Weight
Pruning strategy on the BERT. Using this
strategy, they prune the BERT model from \(0\%\) to \(90\%\) in
increments of \(10\%\). Their experiments show that using a
simple pruning strategy like magnitude weight pruning can
remove \(30\%-40\%\) of the weights without hurting
pre-training loss or inference on any downstream task.
However, with the sparsities keeps increasing, the
pre-training loss starts increasing, and performance starts
degrading.
Similarly, Sanh et al. uses a modified magnitude
weight pruning strategy to prune the weights.
The basic
idea is to learn a scoring matrix during training,
indicating the importance of each weight. Intuitively,
magnitude selects the weights that are far from zero. In
contrast, movement pruning selects the weights that moving
away from zero during the training process. The experiments
method, BERT can achieve \(95\%\) of the original BERT
performance with only \(5\%\) of the encoder’s weight on
natural language inference (MNLI) and
question answering (SQuAD v1.1) .
Different from Gordon et al. and
Sanh et al., which prune the weights depending on
the value of each weight, Guo et al. use a
regularizer to constraint the weights. Guo et al.
proposes a pruning method called Reweighted Proximal Pruning
(RPP), in which the authors iteratively reweight the
regularizer \(L_1\). In their experiments, they show that RPP
can achieve \(59.3\%\) weight sparsity without inducing the
performance loss on both pre-training and fine-tuning tasks.
Structured Pruning
Different from element-wise pruning, structured pruning
compresses models by identifying and removing component
modules, such as attention heads or embedding layers.
Self-attention in transformers is used to let the model find
the most related part of the input. However,
Raganato et al. show that replacing these learnable
attention heads with a simple fixed non-learnable
attentive pattern does not impact the translation quality.
These attentive patterns are solely based on position and do
not require any external knowledge.
Fan et al. propose a different structured pruning
method which adopts the idea of dropout. In learning neural
networks, we usually apply dropout to achieve better
performance. Fan et al. follows the same idea, but
here, they drop an entire layer instead of each weight. They
called this method LayerDrop. In the training time, they
sample a set of layers to be dropped in each iteration; in
the test time, we can choose how many layers we want to use
depending on the task and computation resources.
Instead of focusing on the attention heads or entire
transformer layers, Wang et al. prune the model by
factorizing the weight matrices. The idea is to factorize
the matrix into two smaller matrices and inserting a
diagonal mask matrix. They prune the diagonal mask matrix
via regularization and use an augmented Lagrangian approach
to control the final sparsity. One advantage of this method
is that this is generic, which can be applied to any matrix
multiplication.