What is model quantization and when should you use it?
What is model quantization and when should you use it?
Model quantization is the process of reducing the numeric precision of a neural network's weights — and sometimes its activations and KV cache — from 16-bit or 32-bit floating point down to lower-precision formats like 8-bit, 4-bit, or even lower. Lower precision means each number takes fewer bits, so the model uses less memory, moves less data, and often runs faster.
You should use quantization when you need to fit a model on cheaper or smaller hardware, increase inference throughput, cut serving cost, or deploy to edge and on-device environments — accepting a usually small, task-dependent drop in accuracy. For most production inference, quantization is one of the highest-leverage optimizations available.
What quantization actually does
A model's weights are normally stored as 16-bit (FP16/BF16) or 32-bit (FP32) floating-point numbers. Quantization maps those values to a lower-precision representation — for example, 8-bit integers (INT8) or 4-bit (INT4) — using a scale (and sometimes a zero-point) to convert back and forth.
The model gets smaller in proportion to the bit reduction: moving from 16-bit to 4-bit roughly quarters the memory footprint of the weights.
This matters because memory is the dominant constraint for large models. A model that requires an expensive high-memory GPU at full precision may fit comfortably on a cheaper GPU once quantized, and because there is less data to move between memory and compute, inference frequently speeds up too.
The trade is a small loss of numeric precision that can slightly reduce accuracy.
The main types of quantization
Post-training quantization (PTQ) quantizes an already-trained model without retraining. It is fast and simple, and for many models the accuracy loss at 8-bit is negligible. Popular PTQ methods for LLMs include GPTQ and AWQ, which use calibration data to choose quantization parameters that preserve quality better than naive rounding.
Quantization-aware training (QAT) simulates quantization during training or fine-tuning so the model learns to be robust to lower precision. It costs more (you must train) but recovers more accuracy, which matters at very low bit-widths like 4-bit or below.
Weight-only vs. Weight-and-activation: weight-only quantization (common for LLMs, e.g., GPTQ/AWQ at INT4) shrinks the large weight matrices while keeping activations in higher precision. Full INT8 quantization of both weights and activations is common for smaller models and accelerators.
KV-cache quantization separately compresses the attention cache, which is a big memory consumer during long-context generation.
FP8 is a newer 8-bit floating-point format supported on recent NVIDIA hardware that preserves more dynamic range than INT8, used by engines like TensorRT-LLM for high-quality, high-speed inference.

Reach Kory White, Fractional CRO: 📅 Book a Quick Call · 💼 Kory on LinkedIn · 🏢 CRO Syndicate
When you should use quantization
Reach for quantization when any of these apply:
- You need to fit a bigger model on smaller hardware. Quantizing to INT4 can let a large model run on a single mid-range GPU instead of multiple high-end ones.
- You want lower serving cost. Fewer or cheaper GPUs for the same traffic directly reduces the bill.
- You need higher throughput. Reduced data movement and support for low-precision compute can raise tokens per second.
- You are deploying to the edge or on-device. Phones, laptops, and embedded devices have tight memory; quantized GGUF models via llama.cpp make local inference practical.
- You serve long contexts. KV-cache quantization reduces the memory that long sequences consume during generation.
When to be cautious
Quantization is not free. At aggressive bit-widths (4-bit and below) or on tasks sensitive to precision — complex reasoning, math, code, or specialized domains — accuracy can drop more than expected. Always evaluate the quantized model on your own task with a representative test set, comparing it against the full-precision baseline; perplexity alone can hide real degradation.
If quality suffers, options include using a less aggressive bit-width (INT8 instead of INT4), a better PTQ method (AWQ often preserves quality well), or quantization-aware training to recover accuracy.
Also confirm hardware and engine support: not every GPU and inference server supports every format. FP8 needs recent NVIDIA hardware; specific INT4 kernels depend on your serving engine. Match the quantization format to what vLLM, TensorRT-LLM, TGI, or llama.cpp can actually run efficiently on your hardware.
A practical workflow
- Baseline. Measure full-precision accuracy on your task and current serving cost/latency.
- Start at INT8 (or FP8 on supported hardware). This usually gives meaningful memory and cost savings with negligible quality loss.
- Evaluate. Compare against baseline on your real test set, not just perplexity.
- Push to INT4 if needed. Use GPTQ or AWQ with calibration data, then re-evaluate.
- If quality drops too far, back off a bit-width or apply quantization-aware fine-tuning.
- Deploy on a supporting engine (vLLM, TensorRT-LLM, TGI, or llama.cpp) and confirm the throughput and memory gains in production.
Frequently Asked Questions
Does quantization always reduce accuracy? There is usually some loss, but at 8-bit it is often negligible and at 4-bit it is small for many tasks with good methods like GPTQ or AWQ. The impact is task-dependent, so always evaluate on your own data rather than assuming.
What is the difference between GPTQ and AWQ? Both are post-training quantization methods for LLMs that use calibration data to preserve quality at low bit-widths like INT4. AWQ (activation-aware weight quantization) often retains accuracy especially well. The best choice depends on your model and engine support; test both.
What is FP8 and why does it matter? FP8 is an 8-bit floating-point format on recent NVIDIA GPUs that keeps more dynamic range than INT8, enabling high-quality, high-speed inference. Engines like TensorRT-LLM use it to get strong performance with minimal accuracy loss.
Should I quantize the KV cache too? For long-context generation, yes — the KV cache is a major memory consumer, and quantizing it (supported by engines like LMDeploy and others) lets you serve longer sequences or more concurrent requests. Evaluate quality impact as you would for weights.
Can I quantize a model and run it anywhere? Not always. The format must be supported by your hardware and inference engine. INT4 GGUF runs widely via llama.cpp; FP8 needs recent NVIDIA GPUs; specific kernels depend on vLLM, TensorRT-LLM, or TGI support. Confirm compatibility before committing.
Is quantization or using a smaller model better? It depends. A quantized large model preserves more capability for hard tasks; a smaller full-precision model may be cheaper and faster for simple tasks. Many teams test both, and use model routing to send each request to the most cost-effective option.
Sources
- GPTQ and AWQ quantization method references
- NVIDIA TensorRT-LLM documentation on FP8 and quantization
- VLLM documentation on supported quantization formats
- Hugging Face documentation on model quantization (bitsandbytes, GPTQ, AWQ)
- Llama.cpp and GGUF quantization documentation
- LMDeploy documentation on KV-cache quantization
