Quantization

Compressing a model's weights to lower numeric precision (e.g. FP8, INT4) so it runs faster and cheaper — usually with a small quality cost.

Models are trained with 16-bit weights (BF16/FP16). Quantization stores those weights at lower precision — FP8 halves memory; INT4 quarters it — letting providers serve the same model on less hardware, faster and cheaper.

The catch is quality: FP8 is usually near-lossless, while aggressive 4-bit quantization can measurably hurt reasoning and code generation. Two providers serving "the same" open-weight model may behave differently because one runs BF16 and the other INT4.

When a provider lists a quantization tag (bf16, fp8, int4), factor it into price comparisons — the cheapest endpoint is sometimes cheap because it's a more heavily quantized serve.

Open-source LLMs →LLM leaderboard →

More terms

Context window →Tokens per second (throughput) →Time to first token (latency) →Artificial Analysis Intelligence Index →GPQA (Diamond) →Elo rating (for LLMs) →