Model Quantization Methods

Model Quantization Methods#

Model quantization methods aim to reduce the size and improve inference speed of the models.

Memory Computation#

Models can be trained and inferenced on different precision levels: float32, float16 (half precision), bfloat16 (developed by Google), int8, int4, int2.
Usually when we convert weights into a different precision, there is some loss of model performance.
Choosing the correct quantization is a tradeoff between accuracy, speed and memory.

Estimating memory for a model#

If a model is 8B parameters in float16, what is the memory consumed by it?

def get_required_memory(params, fp=16):
    parms = params # billions
    floating_point = fp
    parms_bytes = parms * (floating_point / 8)
    # 1 kilobyte = 1024 bytes, 1 megabyte = 1024 kilobytes, 1 gigabyte = 1024 megabytes
    parms_bytes_gb = parms_bytes / 1024 / 1024 / 1024
    return parms_bytes_gb

get_required_memory(8e9,fp=16)

14.901161193847656

get_required_memory(8e9,fp=8)

7.450580596923828

get_required_memory(8e9,fp=4)

3.725290298461914

get_required_memory(70e9,fp=16)

130.385160446167

Quantization Methods#

Post-training Quantization: Quantization is done after training. It may lead to accuracy drop.
Quantization Aware Training: Involves training with quantization in mind
Mixed-Precision Quantization: Some weights are computed with higher precision & some weights are quantized to lower precision

K_S#

Uniform quantization, divides the float unifromly into buckets to achieve desired quantization.
Most simple method, fast but may lose significant accuracy

K_M#

Non-uniform quantization, the distribution of the quantizer is learned with respect to weights of the model. It is more complex than K_S, but offers better performance. A simple k-means clustering would allow us to round the number to the nearest cluster.

K_L#

The k-l divergence between the original and quantized model weights is minimized.
Results in lower information loss
Computationally expensive

How to quantize a model?#

from transformers import AutoModelForCausalLM
from optimum.quanto import quantize, qint8

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[6], line 1
----> 1 from transformers import AutoModelForCausalLM
      2 from optimum.quanto import quantize, qint8

ModuleNotFoundError: No module named 'transformers'

model = AutoModelForCausalLM.from_pretrained('gpt2')

model.lm_head

Linear(in_features=768, out_features=50257, bias=False)

quantize(model, weights=qint8, activations=qint8)

from optimum.quanto import freeze

freeze(model)

model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): QLayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): QLayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): QLayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): QLinear(in_features=768, out_features=50257, bias=False)
)

Model Quantization Methods

Contents

Model Quantization Methods#

Memory Computation#

Estimating memory for a model#

Quantization Methods#

K_S#

K_M#

K_L#

How to quantize a model?#