Model Quantization Methods#

  • Model quantization methods aim to reduce the size and improve inference speed of the models.

Memory Computation#

  • Models can be trained and inferenced on different precision levels: float32, float16 (half precision), bfloat16 (developed by Google), int8, int4, int2.

  • Usually when we convert weights into a different precision, there is some loss of model performance.

  • Choosing the correct quantization is a tradeoff between accuracy, speed and memory.

Estimating memory for a model#

If a model is 8B parameters in float16, what is the memory consumed by it?

def get_required_memory(params, fp=16):
    parms = params # billions
    floating_point = fp
    parms_bytes = parms * (floating_point / 8)
    # 1 kilobyte = 1024 bytes, 1 megabyte = 1024 kilobytes, 1 gigabyte = 1024 megabytes
    parms_bytes_gb = parms_bytes / 1024 / 1024 / 1024
    return parms_bytes_gb
get_required_memory(8e9,fp=16)
14.901161193847656
get_required_memory(8e9,fp=8)
7.450580596923828
get_required_memory(8e9,fp=4)
3.725290298461914
get_required_memory(70e9,fp=16)
130.385160446167

Quantization Methods#

  • Post-training Quantization: Quantization is done after training. It may lead to accuracy drop.

  • Quantization Aware Training: Involves training with quantization in mind

  • Mixed-Precision Quantization: Some weights are computed with higher precision & some weights are quantized to lower precision

K_S#

  • Uniform quantization, divides the float unifromly into buckets to achieve desired quantization.

  • Most simple method, fast but may lose significant accuracy

K_M#

  • Non-uniform quantization, the distribution of the quantizer is learned with respect to weights of the model. It is more complex than K_S, but offers better performance. A simple k-means clustering would allow us to round the number to the nearest cluster.

K_L#

  • The k-l divergence between the original and quantized model weights is minimized.

  • Results in lower information loss

  • Computationally expensive

How to quantize a model?#

from transformers import AutoModelForCausalLM
from optimum.quanto import quantize, qint8
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[6], line 1
----> 1 from transformers import AutoModelForCausalLM
      2 from optimum.quanto import quantize, qint8

ModuleNotFoundError: No module named 'transformers'
model = AutoModelForCausalLM.from_pretrained('gpt2')
model.lm_head
Linear(in_features=768, out_features=50257, bias=False)
quantize(model, weights=qint8, activations=qint8)
from optimum.quanto import freeze

freeze(model)
model
GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): QLayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): QLayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): QLayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): QLinear(in_features=768, out_features=50257, bias=False)
)