A survey of BERT compressin using data quantization.

Published

September 3, 2020

Update

January 2, 2023

Introduction

Since 2018, BERT
and its variants (RoBERT and
XLNet) show
significant improvements in many Natural Language Processing
tasks. However, all these models contain a large number of
parameters, which results in prohibitive memory footprint
and slow inference. In order to overcome such problems, many
methods have been proposed to compress these large models.
In this post, I will survey papers that compress these large
models using Data Quantization.

Data quantization refers to representing each model
weight using fewer bits, which reduces the memory footprint
of the model and lowers the precision of its numerical
calculations. There are a few characteristics of data
quantization:

It can reduce memory requirements.

With proper hardware, it can also improve inference speed.

It can be applied to all different components of one model.

It does not change the model architecture.

A straightforward idea is directly to truncate the weight
bits to the target bandwidth. However, this operation
usually leads to a large drop in
accuracy Ganesh et al.. In order to
overcome this problem, the Quantization Aware
Training(QAT) approach is proposed.
Unlike simply truncating the weights, QAT quantizes the
weights and retrain the model to adjust the new weights.

Data Quantization Papers

There are not many papers that directly quantizes the models
of the BERT model family.

Zafrir et al.
quantizes the all general matrix
multiply operations in BERT fully connected and embedding
layers. All these weights are quantized to 8-bit integers.
In the fine-tuning process, they manually introduce some
quantization error to let the model learn the error gap.
They use Straight-Through
Estimator(STE) to estimate the
gradient. According to their experiments, they claim they
can compress the model by \(4\) times with minimal accuracy
loss.

Similarly, Fan et al. also tries to let the model
adjust to quantization during the training process. In this
paper, the authors made a simple modification to the QAT
approach. The idea is, in the forward pass, they randomly
select a subset of weights instead of the full network as in
QAT. Other parts of the weights remain the same. They call
this approach Quant-Noise. I think Quant-Noise is very
similar to the idea of dropout.

Instead of quantizing all the weights into the same
bandwidth, Shen et al. tries to quantizes models
dynamically. The motivation is different components (or
layers) of a model that have different sensitivity. We
should assign more bits to more sensitive components. In
this paper, they proposed a new hessian matrix-based metric
to evaluate the sensitivity of different layers.
Interestingly, in their experiments, they find that the
embedding layer is more sensitive than other weights for
quantization, which means the embedding layer needs more
bits to maintain accuracy.