Weight pruning is another line of work to compress BERT style models. In this post, I will survey BERT related weight pruning papers.

In general, weight pruning means identifying and removing redundant or less essential weights and/or components. Related work can be divided into three categories:

  1. Prove the weights that are indeed redundant and prunable.
  2. Element-wise pruning.
  3. Structured pruning.

Just like the data quantization, one characteristic of this line of work is, after pruning, the weights usually become very sparse. Sparse weights can reduce the model size largely. However, it can not speed up the inference time. As a result, special hardware is needed to achieve a decent computation speed-up after element-wise pruning.

Prove Weights are Redundant

This line of work empirically shows that weights of large transformer-based models are indeed redundant. A significant part of weights can be removed without hurting the final performance.

Kovaleva et al. (2019) conducts \(6\) different experiments regarding the attentions on BERT model. The authors conclude that BERT model is highly over-parametered by showing a repeated self-attention pattern in different heads. Moreover, in their disabling experiments, they find that both single and multiple heads are not detrimental to model performance and, in some cases, even improves it. Their experiments also show that the last two layers encode the task-specific features and contribute to the performance gain.

Similarly, Michel et al. (2019) also conducts extensive experiments to analyze the importance of attention heads in transformer-based models. They perform their experiments on WMT (Ott et al., 2018) and BERT (Devlin et al., 2019). In their first setting, they remove each head to see the impact of the removed head. They find that the majority of attention heads can be removed without hurting too much from the original score. They also find in some cases, removing an attention head can increase performance, which is consistent with the conclusions of Kovaleva et al. (2019). To answer the question that is more than one head is needed? they remove all attention heads but one within a single layer. Surprisingly, they find that one head is indeed sufficient at test time. Next, they also try to prune these attention heads iteratively. Experiments show that removing \(20\%\) to \(40\%\) of heads does not incur any noticeable negative impact.

Consistent results are also found in the paper Voita et al. (2019). But in Voita et al. (2019), they take a further step showing that some "important heads" can be characterized into 3 functions: (i) positional (ii) syntactic, and (iii) rare words. Besides, they also propose a new pruning method by applying a regularized objective when fine-tuning. They observe that these specialized heads are often the last to be pruned, which confirms their importance.

Element-wise Pruning

After empirically showing the weights are indeed redundant, many pruning methods are proposed—one line of work focus on identifying and removing individual weights of the given model. The importance of each weight can be determined by its absolute value, gradients, or other measurements defined by designers.

Gordon et al. (2020) apply a simple Magnitude Weight Pruning (Han et al., 2015) strategy on the BERT. Using this strategy, they prune the BERT model from \(0\%\) to \(90\%\) in increments of \(10\%\). Their experiments show that using a simple pruning strategy like magnitude weight pruning can remove \(30\%-40\%\) of the weights without hurting pre-training loss or inference on any downstream task. However, with the sparsities keeps increasing, the pre-training loss starts increasing, and performance starts degrading.

Similarly, Sanh et al. (2020) uses a modified magnitude weight pruning strategy to prune the weights. The basic idea is to learn a scoring matrix during training, indicating the importance of each weight. Intuitively, magnitude selects the weights that are far from zero. In contrast, movement pruning selects the weights that moving away from zero during the training process. The experiments show that using this modified magnitude weight pruning method, BERT can achieve \(95\%\) of the original BERT performance with only \(5\%\) of the encoder's weight on natural language inference (MNLI) (Williams et al., 2018) and question answering (SQuAD v1.1) (Rajpurkar et al., 2016).

Different from Gordon et al. (2020) and Sanh et al. (2020), which prune the weights depending on the value of each weight, Guo et al. (2019) use a regularizer to constraint the weights. Guo et al. (2019) proposes a pruning method called Reweighted Proximal Pruning (RPP), in which the authors iteratively reweight the regularizer \(L_1\). In their experiments, they show that RPP can achieve \(59.3\%\) weight sparsity without inducing the performance loss on both pre-training and fine-tuning tasks.

Structured Pruning

Different from element-wise pruning, structured pruning compresses models by identifying and removing component modules, such as attention heads or embedding layers.

Self-attention in transformers is used to let the model find the most related part of the input. However, Raganato et al. (2020) show that replacing these learnable attention heads with a simple fixed non-learnable attentive pattern does not impact the translation quality. These attentive patterns are solely based on position and do not require any external knowledge.

Fan et al. (2019) propose a different structured pruning method which adopts the idea of dropout. In learning neural networks, we usually apply dropout to achieve better performance. Fan et al. (2019) follows the same idea, but here, they drop an entire layer instead of each weight. They called this method LayerDrop. In the training time, they sample a set of layers to be dropped in each iteration; in the test time, we can choose how many layers we want to use depending on the task and computation resources.

Instead of focusing on the attention heads or entire transformer layers, Wang et al. (2019) prune the model by factorizing the weight matrices. The idea is to factorize the matrix into two smaller matrices and inserting a diagonal mask matrix. They prune the diagonal mask matrix via regularization and use an augmented Lagrangian approach to control the final sparsity. One advantage of this method is that this is generic, which can be applied to any matrix multiplication.


A list of weight pruning papers:

The comparison between weight pruing papers:

Paper Memory (BERT:100%) Inference Component Pretrain Performace
(Zafrir et al., 2019) N/A N/A N/A N/A N/A
(Fan et al., 2020) N/A N/A N/A N/A N/A
(Shen et al., 2020) N/A N/A N/A N/A N/A
(Gordon et al., 2020) 30~40% N/A all weights True Same
(Guo et al., 2019) 12%~41% N/A all weights True worse
(Sanh et al., 2020) 3%~10% N/A layers and heads. False worse
(Raganato et al., 2020) N/A N/A attention heads False similar
(Fan et al., 2019) 25%~50% N/A transformer layers True same
(Wang et al., 2019) 65% N/A matrix multiplication False slightly worse


Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. 2019.

Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556, 2019. 1 2 3 4

Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, R'emi Gribonval, Herv'e J'egou, and Armand Joulin. Training with quantization noise for extreme model compression. ArXiv, abs/2004.07320, 2020.

Mitchell A Gordon, Kevin Duh, and Nicholas Andrews. Compressing bert: studying the effects of weight pruning on transfer learning. arXiv preprint arXiv:2002.08307, 2020. 1 2 3 4

Fu-Ming Guo, Sijia Liu, Finlay S Mungall, Xue Lin, and Yanzhi Wang. Reweighted proximal pruning for large-scale language representation. arXiv preprint arXiv:1909.12486, 2019. 1 2 3 4

Song Han, Jeff Pool, John Tran, and William J Dally. Learning both weights and connections for efficient neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 1, 1135–1143. 2015.

Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. Revealing the dark secrets of bert. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 4365–4374. 2019. 1 2 3

Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? In Advances in Neural Information Processing Systems, 14014–14024. 2019. 1 2

Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, 1–9. 2018.

Alessandro Raganato, Yves Scherrer, and J"org Tiedemann. Fixed encoder self-attention patterns in transformer-based machine translation. arXiv preprint arXiv:2002.10260, 2020. 1 2 3

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2383–2392. 2016.

Victor Sanh, Thomas Wolf, and Alexander M Rush. Movement pruning: adaptive sparsity by fine-tuning. arXiv preprint arXiv:2005.07683, 2020. 1 2 3 4

Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Q-bert: hessian based ultra low precision quantization of bert. In AAAI, 8815–8821. 2020.

Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 5797–5808. 2019. 1 2 3

Ziheng Wang, Jeremy Wohlwend, and Tao Lei. Structured pruning of large language models. arXiv preprint arXiv:1910.04732, 2019. 1 2 3

Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 1112–1122. 2018.

Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. Q8bert: quantized 8bit bert. arXiv preprint arXiv:1910.06188, 2019.

Share on: TwitterFacebookEmail

Yichu Zhou is the owner of this blog.

So what do you think? Did I miss something? Is any part unclear? Leave your comments below

comments powered by Disqus

Reading Time

~5 min read