Model Quantization Methods#
Model quantization methods aim to reduce the size and improve inference speed of the models.
Memory Computation#
Models can be trained and inferenced on different precision levels: float32, float16 (half precision), bfloat16 (developed by Google), int8, int4, int2.
Usually when we convert weights into a different precision, there is some loss of model performance.
Choosing the correct quantization is a tradeoff between accuracy, speed and memory.
Estimating memory for a model#
If a model is 8B parameters in float16, what is the memory consumed by it?
def get_required_memory(params, fp=16):
parms = params # billions
floating_point = fp
parms_bytes = parms * (floating_point / 8)
# 1 kilobyte = 1024 bytes, 1 megabyte = 1024 kilobytes, 1 gigabyte = 1024 megabytes
parms_bytes_gb = parms_bytes / 1024 / 1024 / 1024
return parms_bytes_gb
get_required_memory(8e9,fp=16)
14.901161193847656
get_required_memory(8e9,fp=8)
7.450580596923828
get_required_memory(8e9,fp=4)
3.725290298461914
get_required_memory(70e9,fp=16)
130.385160446167
Quantization Methods#
Post-training Quantization: Quantization is done after training. It may lead to accuracy drop.
Quantization Aware Training: Involves training with quantization in mind
Mixed-Precision Quantization: Some weights are computed with higher precision & some weights are quantized to lower precision
K_S#
Uniform quantization, divides the float unifromly into buckets to achieve desired quantization.
Most simple method, fast but may lose significant accuracy
K_M#
Non-uniform quantization, the distribution of the quantizer is learned with respect to weights of the model. It is more complex than K_S, but offers better performance. A simple k-means clustering would allow us to round the number to the nearest cluster.
K_L#
The k-l divergence between the original and quantized model weights is minimized.
Results in lower information loss
Computationally expensive
How to quantize a model?#
from transformers import AutoModelForCausalLM
from optimum.quanto import quantize, qint8
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[6], line 1
----> 1 from transformers import AutoModelForCausalLM
2 from optimum.quanto import quantize, qint8
ModuleNotFoundError: No module named 'transformers'
model = AutoModelForCausalLM.from_pretrained('gpt2')
model.lm_head
Linear(in_features=768, out_features=50257, bias=False)
quantize(model, weights=qint8, activations=qint8)
from optimum.quanto import freeze
freeze(model)
model
GPT2LMHeadModel(
(transformer): GPT2Model(
(wte): Embedding(50257, 768)
(wpe): Embedding(1024, 768)
(drop): Dropout(p=0.1, inplace=False)
(h): ModuleList(
(0-11): 12 x GPT2Block(
(ln_1): QLayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): GPT2SdpaAttention(
(c_attn): Conv1D()
(c_proj): Conv1D()
(attn_dropout): Dropout(p=0.1, inplace=False)
(resid_dropout): Dropout(p=0.1, inplace=False)
)
(ln_2): QLayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): GPT2MLP(
(c_fc): Conv1D()
(c_proj): Conv1D()
(act): NewGELUActivation()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(ln_f): QLayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(lm_head): QLinear(in_features=768, out_features=50257, bias=False)
)