Weight pruning is another line of work to compress BERT style models. In this post, I will survey BERT related weight pruning papers.
In general, weight pruning means identifying and removing redundant or less essential weights and/or components. Related work can be divided into three categories:
- Prove the weights that are indeed redundant and prunable.
- Element-wise pruning.
- Structured pruning.
Just like the data quantization, one characteristic of this line of work is, after pruning, the weights usually become very sparse. Sparse weights can reduce the model size largely. However, it can not speed up the inference time. As a result, special hardware is needed to achieve a decent computation speed-up after element-wise pruning.
This line of work empirically shows that weights of large transformer-based models are indeed redundant. A significant part of weights can be removed without hurting the final performance.
Kovaleva et al. (2019) conducts \(6\) different experiments regarding the attentions on BERT model. The authors conclude that BERT model is highly over-parametered by showing a repeated self-attention pattern in different heads. Moreover, in their disabling experiments, they find that both single and multiple heads are not detrimental to model performance and, in some cases, even improves it. Their experiments also show that the last two layers encode the task-specific features and contribute to the performance gain.
Similarly, Michel et al. (2019) also conducts extensive experiments to analyze the importance of attention heads in transformer-based models. They perform their experiments on WMT (Ott et al., 2018) and BERT (Devlin et al., 2019). In their first setting, they remove each head to see the impact of the removed head. They find that the majority of attention heads can be removed without hurting too much from the original score. They also find in some cases, removing an attention head can increase performance, which is consistent with the conclusions of Kovaleva et al. (2019). To answer the question that is more than one head is needed? they remove all attention heads but one within a single layer. Surprisingly, they find that one head is indeed sufficient at test time. Next, they also try to prune these attention heads iteratively. Experiments show that removing \(20\%\) to \(40\%\) of heads does not incur any noticeable negative impact.
Consistent results are also found in the paper Voita et al. (2019). But in Voita et al. (2019), they take a further step showing that some "important heads" can be characterized into 3 functions: (i) positional (ii) syntactic, and (iii) rare words. Besides, they also propose a new pruning method by applying a regularized objective when fine-tuning. They observe that these specialized heads are often the last to be pruned, which confirms their importance.
After empirically showing the weights are indeed redundant, many pruning methods are proposed—one line of work focus on identifying and removing individual weights of the given model. The importance of each weight can be determined by its absolute value, gradients, or other measurements defined by designers.
Gordon et al. (2020) apply a simple Magnitude Weight Pruning (Han et al., 2015) strategy on the BERT. Using this strategy, they prune the BERT model from \(0\%\) to \(90\%\) in increments of \(10\%\). Their experiments show that using a simple pruning strategy like magnitude weight pruning can remove \(30\%-40\%\) of the weights without hurting pre-training loss or inference on any downstream task. However, with the sparsities keeps increasing, the pre-training loss starts increasing, and performance starts degrading.
Similarly, Sanh et al. (2020) uses a modified magnitude weight pruning strategy to prune the weights. The basic idea is to learn a scoring matrix during training, indicating the importance of each weight. Intuitively, magnitude selects the weights that are far from zero. In contrast, movement pruning selects the weights that moving away from zero during the training process. The experiments show that using this modified magnitude weight pruning method, BERT can achieve \(95\%\) of the original BERT performance with only \(5\%\) of the encoder's weight on natural language inference (MNLI) (Williams et al., 2018) and question answering (SQuAD v1.1) (Rajpurkar et al., 2016).
Different from Gordon et al. (2020) and Sanh et al. (2020), which prune the weights depending on the value of each weight, Guo et al. (2019) use a regularizer to constraint the weights. Guo et al. (2019) proposes a pruning method called Reweighted Proximal Pruning (RPP), in which the authors iteratively reweight the regularizer \(L_1\). In their experiments, they show that RPP can achieve \(59.3\%\) weight sparsity without inducing the performance loss on both pre-training and fine-tuning tasks.
Different from element-wise pruning, structured pruning compresses models by identifying and removing component modules, such as attention heads or embedding layers.
Self-attention in transformers is used to let the model find the most related part of the input. However, Raganato et al. (2020) show that replacing these learnable attention heads with a simple fixed non-learnable attentive pattern does not impact the translation quality. These attentive patterns are solely based on position and do not require any external knowledge.
Fan et al. (2019) propose a different structured pruning method which adopts the idea of dropout. In learning neural networks, we usually apply dropout to achieve better performance. Fan et al. (2019) follows the same idea, but here, they drop an entire layer instead of each weight. They called this method LayerDrop. In the training time, they sample a set of layers to be dropped in each iteration; in the test time, we can choose how many layers we want to use depending on the task and computation resources.
Instead of focusing on the attention heads or entire transformer layers, Wang et al. (2019) prune the model by factorizing the weight matrices. The idea is to factorize the matrix into two smaller matrices and inserting a diagonal mask matrix. They prune the diagonal mask matrix via regularization and use an augmented Lagrangian approach to control the final sparsity. One advantage of this method is that this is generic, which can be applied to any matrix multiplication.
A list of weight pruning papers:
- Revealing the dark secrets of BERT (Kovaleva et al., 2019) pdf
- Are sixteen heads really better than one? (Michel et al., 2019) pdf
- Analyzing Multi-Head Self-Attention-Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned (Voita et al., 2019) pdf
- Compressing BERT-Studying the Effects of Weight Pruning on Transfer Learning (Gordon et al., 2020) pdf
- Reweighted Proximal Pruning for Large-scale Language Representation (Guo et al., 2019) pdf
- Movement Pruning Adaptive Sparsity by Fine-Tuning (Sanh et al., 2020) pdf
- Fixed encoder self-attention patterns in transformer-based machine translation (Raganato et al., 2020) pdf
- Reducing transformer depth on demand with structured dropout (Fan et al., 2019) pdf
- Structured Pruning of Large Language Models (Wang et al., 2019) pdf
The comparison between weight pruing papers:
|(Zafrir et al., 2019)||N/A||N/A||N/A||N/A||N/A|
|(Fan et al., 2020)||N/A||N/A||N/A||N/A||N/A|
|(Shen et al., 2020)||N/A||N/A||N/A||N/A||N/A|
|(Gordon et al., 2020)||30~40%||N/A||all weights||True||Same|
|(Guo et al., 2019)||12%~41%||N/A||all weights||True||worse|
|(Sanh et al., 2020)||3%~10%||N/A||layers and heads.||False||worse|
|(Raganato et al., 2020)||N/A||N/A||attention heads||False||similar|
|(Fan et al., 2019)||25%~50%||N/A||transformer layers||True||same|
|(Wang et al., 2019)||65%||N/A||matrix multiplication||False||slightly worse|
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. 2019. ↩
Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, R'emi Gribonval, Herv'e J'egou, and Armand Joulin. Training with quantization noise for extreme model compression. ArXiv, abs/2004.07320, 2020. ↩
Song Han, Jeff Pool, John Tran, and William J Dally. Learning both weights and connections for efficient neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 1, 1135–1143. 2015. ↩
Olga Kovaleva, Alexey Romanov, Anna Rogers, and Anna Rumshisky. Revealing the dark secrets of bert. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 4365–4374. 2019. ↩ 1 2 3
Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, 1–9. 2018. ↩
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2383–2392. 2016. ↩
Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Q-bert: hessian based ultra low precision quantization of bert. In AAAI, 8815–8821. 2020. ↩
Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 5797–5808. 2019. ↩ 1 2 3
Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 1112–1122. 2018. ↩
Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. Q8bert: quantized 8bit bert. arXiv preprint arXiv:1910.06188, 2019. ↩
So what do you think? Did I miss something? Is any part unclear? Leave your comments below