Quantization, in the context of machine learning models like large language models (LLMs), is a technique used to optimize the model’s performance.

This is generally done to allow a model to run on lower-end hardware, e.g. less memory.

The process of quantization involves a trade-off between model size/speed and accuracy.

Lower Precision

At a lower level, it achieves the increased performance by lowering the precision of model parameters. For instance, if a model’s parameters are stored as 32-bit floating point numbers (FP32), they can be converted to 8-bit integers (INT8).

Integers take up less space in memory and can be processed more quickly as lower-precision operations (e.g. matrix multiplication) generally require fewer computational resources.

Specialized Hardware

TPUs and GPUs have optimizations for low-precision arithmetic, allowing them to run these quantized models even faster.