Since 2018, BERT (Devlin et al., 2019) and its variants (Liu et al. (2019), Yang et al. (2019)) show significant improvements in many Natural Language Processing tasks. However, all these models contain a large number of parameters, which results in prohibitive memory footprint and slow inference. In order to overcome such problems, many methods have been proposed to compress these large models. In this post, I will survey papers that compress these large models using Data Quantization.

Data quantization refers to representing each model weight using fewer bits, which reduces the memory footprint of the model and lowers the precision of its numerical calculations. There are a few characteristics of data quantization:

  • It can reduce memory requirements.
  • With proper hardware, it can also improve inference speed.
  • It can be applied to all different components of one model.
  • It does not change the model architecture.

A straightforward idea is directly to truncate the weight bits to the target bandwidth. However, this operation usually leads to a large drop in accuracy(Ganesh et al., 2020). In order to overcome this problem, the Quantization Aware Training(QAT)(Jacob et al., 2018) approach is proposed. Unlike simply truncating the weights, QAT quantizes the weights and retrain the model to adjust the new weights.

Data Quantization Papers

There are not many papers that directly quantizes the models of the BERT model family.

Zafrir et al. (2019) quantizes the all general matrix multiply operations in BERT fully connected and embedding layers. All these weights are quantized to 8-bit integers. In the fine-tuning process, they manually introduce some quantization error to let the model learn the error gap. They use Straight-Through Estimator(STE)(Jacob et al., 2018) to estimate the gradient. According to their experiments, they claim they can compress the model by \(4\) times with minimal accuracy loss.

Similarly, Fan et al. (2020) also tries to let the model adjust to quantization during the training process. In this paper, the authors made a simple modification to the QAT approach. The idea is, in the forward pass, they randomly select a subset of weights instead of the full network as in QAT. Other parts of the weights remain the same. They call this approach Quant-Noise. I think Quant-Noise is very similar to the idea of dropout.

Instead of quantizing all the weights into the same bandwidth, Shen et al. (2020) tries to quantizes models dynamically. The motivation is different components (or layers) of a model that have different sensitivity. We should assign more bits to more sensitive components. In this paper, they proposed a new hessian matrix-based metric to evaluate the sensitivity of different layers. Interestingly, in their experiments, they find that the embedding layer is more sensitive than other weights for quantization, which means the embedding layer needs more bits to maintain accuracy.


The above three papers are the only ones that I can find involving applying data quantization to BERT style models. I think one possible reason is that data quantization methods can only reduce the model size, not the inference time. Although data quantization can also improve the inference speed with proper hardware, I hardly think anyone has such hardware.

The list of data quantization papers:

The comparison between data quantization papers:

Paper Memory (BERT:100%) Inference Component Pretrain Performace
Zafrir et al. (2019) 25% N/A Float Bits False very close
Fan et al. (2020) 5% N/A Float Bits False worse
Shen et al. (2020) 7.8% N/A Float Bits False slightly worse


Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. 2019.

Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, R'emi Gribonval, Herv'e J'egou, and Armand Joulin. Training with quantization noise for extreme model compression. ArXiv, abs/2004.07320, 2020. 1 2 3

Prakhar Ganesh, Yao Chen, Xin Lou, Mohammad Ali Khan, Yin Yang, Deming Chen, Marianne Winslett, Hassan Sajjad, and Preslav Nakov. Compressing large-scale transformer-based models: a case study on bert. arXiv preprint arXiv:2002.11985, 2020.

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2704–2713. IEEE, 2018. 1 2

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: a robustly optimized bert pretraining approach. arXiv, pages arXiv–1907, 2019.

Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Q-bert: hessian based ultra low precision quantization of bert. In AAAI, 8815–8821. 2020. 1 2 3

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, 5753–5763. 2019.

Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. Q8bert: quantized 8bit bert. arXiv preprint arXiv:1910.06188, 2019. 1 2 3

Share on: TwitterFacebookEmail

Yichu Zhou is the owner of this blog.

So what do you think? Did I miss something? Is any part unclear? Leave your comments below

comments powered by Disqus

Reading Time

~3 min read