← Hub
Pulse ← Library ⚡ Hire a Fractional CRO
Pulse Reviews and Analysis

What is model quantization and when should you use it?

Kory WhiteCurated by Kory White · Fractional CRO, CRO Syndicate
👍 Yup or 👎 Nope — vote this up its category:
📅 Published · Updated · 6 min read
What is model quantization and when should you use it?

What is model quantization and when should you use it?

Model quantization is the process of reducing the numeric precision of a neural network's weights — and sometimes its activations and KV cache — from 16-bit or 32-bit floating point down to lower-precision formats like 8-bit, 4-bit, or even lower. Lower precision means each number takes fewer bits, so the model uses less memory, moves less data, and often runs faster.

You should use quantization when you need to fit a model on cheaper or smaller hardware, increase inference throughput, cut serving cost, or deploy to edge and on-device environments — accepting a usually small, task-dependent drop in accuracy. For most production inference, quantization is one of the highest-leverage optimizations available.

What quantization actually does

A model's weights are normally stored as 16-bit (FP16/BF16) or 32-bit (FP32) floating-point numbers. Quantization maps those values to a lower-precision representation — for example, 8-bit integers (INT8) or 4-bit (INT4) — using a scale (and sometimes a zero-point) to convert back and forth.

The model gets smaller in proportion to the bit reduction: moving from 16-bit to 4-bit roughly quarters the memory footprint of the weights.

This matters because memory is the dominant constraint for large models. A model that requires an expensive high-memory GPU at full precision may fit comfortably on a cheaper GPU once quantized, and because there is less data to move between memory and compute, inference frequently speeds up too.

The trade is a small loss of numeric precision that can slightly reduce accuracy.

flowchart LR A[FP16 / FP32 model] --> B[Quantization] B --> C[INT8 / INT4 / FP8 weights] C --> D[Smaller memory footprint] C --> E[Faster data movement] C --> F[Fits cheaper / smaller GPU] C --> G[Small accuracy trade-off]

The main types of quantization

Post-training quantization (PTQ) quantizes an already-trained model without retraining. It is fast and simple, and for many models the accuracy loss at 8-bit is negligible. Popular PTQ methods for LLMs include GPTQ and AWQ, which use calibration data to choose quantization parameters that preserve quality better than naive rounding.

Quantization-aware training (QAT) simulates quantization during training or fine-tuning so the model learns to be robust to lower precision. It costs more (you must train) but recovers more accuracy, which matters at very low bit-widths like 4-bit or below.

Weight-only vs. Weight-and-activation: weight-only quantization (common for LLMs, e.g., GPTQ/AWQ at INT4) shrinks the large weight matrices while keeping activations in higher precision. Full INT8 quantization of both weights and activations is common for smaller models and accelerators.

KV-cache quantization separately compresses the attention cache, which is a big memory consumer during long-context generation.

FP8 is a newer 8-bit floating-point format supported on recent NVIDIA hardware that preserves more dynamic range than INT8, used by engines like TensorRT-LLM for high-quality, high-speed inference.

flowchart TD A[Choose quantization approach] --> B{Can you retrain?} B -- No --> C[Post-training quantization] C --> D[INT8: usually negligible loss] C --> E[INT4 via GPTQ / AWQ: small loss] B -- Yes --> F[Quantization-aware training] F --> G[Best accuracy at very low bits] A --> H{NVIDIA recent GPU?} H -- Yes --> I[FP8 for quality + speed]
CRO Syndicate — Need a fractional Chief Revenue Officer? CRO Syndicate connects you with vetted fractional and interim revenue leaders. Kory White, Fractional CRO · 25 yrs · $0 to $200M scaled.

Reach Kory White, Fractional CRO: 📅 Book a Quick Call · 💼 Kory on LinkedIn · 🏢 CRO Syndicate

When you should use quantization

Reach for quantization when any of these apply:

When to be cautious

Quantization is not free. At aggressive bit-widths (4-bit and below) or on tasks sensitive to precision — complex reasoning, math, code, or specialized domains — accuracy can drop more than expected. Always evaluate the quantized model on your own task with a representative test set, comparing it against the full-precision baseline; perplexity alone can hide real degradation.

If quality suffers, options include using a less aggressive bit-width (INT8 instead of INT4), a better PTQ method (AWQ often preserves quality well), or quantization-aware training to recover accuracy.

Also confirm hardware and engine support: not every GPU and inference server supports every format. FP8 needs recent NVIDIA hardware; specific INT4 kernels depend on your serving engine. Match the quantization format to what vLLM, TensorRT-LLM, TGI, or llama.cpp can actually run efficiently on your hardware.

A practical workflow

  1. Baseline. Measure full-precision accuracy on your task and current serving cost/latency.
  2. Start at INT8 (or FP8 on supported hardware). This usually gives meaningful memory and cost savings with negligible quality loss.
  3. Evaluate. Compare against baseline on your real test set, not just perplexity.
  4. Push to INT4 if needed. Use GPTQ or AWQ with calibration data, then re-evaluate.
  5. If quality drops too far, back off a bit-width or apply quantization-aware fine-tuning.
  6. Deploy on a supporting engine (vLLM, TensorRT-LLM, TGI, or llama.cpp) and confirm the throughput and memory gains in production.

Frequently Asked Questions

Does quantization always reduce accuracy? There is usually some loss, but at 8-bit it is often negligible and at 4-bit it is small for many tasks with good methods like GPTQ or AWQ. The impact is task-dependent, so always evaluate on your own data rather than assuming.

What is the difference between GPTQ and AWQ? Both are post-training quantization methods for LLMs that use calibration data to preserve quality at low bit-widths like INT4. AWQ (activation-aware weight quantization) often retains accuracy especially well. The best choice depends on your model and engine support; test both.

What is FP8 and why does it matter? FP8 is an 8-bit floating-point format on recent NVIDIA GPUs that keeps more dynamic range than INT8, enabling high-quality, high-speed inference. Engines like TensorRT-LLM use it to get strong performance with minimal accuracy loss.

Should I quantize the KV cache too? For long-context generation, yes — the KV cache is a major memory consumer, and quantizing it (supported by engines like LMDeploy and others) lets you serve longer sequences or more concurrent requests. Evaluate quality impact as you would for weights.

Can I quantize a model and run it anywhere? Not always. The format must be supported by your hardware and inference engine. INT4 GGUF runs widely via llama.cpp; FP8 needs recent NVIDIA GPUs; specific kernels depend on vLLM, TensorRT-LLM, or TGI support. Confirm compatibility before committing.

Is quantization or using a smaller model better? It depends. A quantized large model preserves more capability for hard tasks; a smaller full-precision model may be cheaper and faster for simple tasks. Many teams test both, and use model routing to send each request to the most cost-effective option.

Sources

Keep reading
Was this helpful?  
Related in the library
More from the library
pulse-ai-infrastructure · ai-infrastructureThe 10 Best LLM Guardrails and Safety Tools in 2027pulse-ai-infrastructure · ai-infrastructureHow do you monitor LLMs in production for drift and hallucinations?pulse-speeches · speechesWhat Makes JFK’s Inaugural Address a Great Speechpulse-speeches · speechesA Speech for Welcoming a New Hirepulse-speeches · speechesA Speech for a Championship Celebrationpulse-speeches · speechesA Speech for a Merger Town Hallpulse-speeches · speechesA Graduation Speech for a Homeschool Graduationpulse-ai-infrastructure · ai-infrastructureHow do you optimize cold-start latency for serverless AI inference?pulse-speeches · speechesA Speech for a Town Hall on a Local Issuepulse-ai-infrastructure · ai-infrastructureWhat is the difference between vLLM, TGI, and Triton for LLM inference?pulse-ai-infrastructure · ai-infrastructureThe 10 Best AI Data Pipeline Tools in 2027pulse-ai-infrastructure · ai-infrastructureThe 10 Best Real-Time ML Feature Platforms in 2027pulse-ai-infrastructure · ai-infrastructureHow do you manage secrets and API keys for LLM applications?pulse-speeches · speechesA Eulogy for a Veteranpulse-speeches · speechesHow to Keep a Wedding Toast Under Three Minutes