How do you reduce GPU costs when serving large language models?

How do you reduce GPU costs when serving large language models?
You reduce GPU serving costs by increasing how many tokens each GPU produces per dollar, and by avoiding model calls you do not need. The highest-impact levers are an efficient inference server with continuous batching (vLLM, TensorRT-LLM), quantization to fit bigger models on cheaper GPUs, caching (exact and semantic) to skip repeated work, right-sizing the model to the task, and autoscaling with spot capacity so you stop paying for idle GPUs.
Most teams that have never optimized can cut serving cost by a large fraction by stacking these techniques.
Start by measuring utilization
You cannot cut what you have not measured. The first step is to look at GPU utilization and tokens per second per GPU for your current deployment. Many self-hosted LLM services run their GPUs at low utilization because requests are processed one at a time or in small static batches, leaving the expensive accelerator mostly idle.
Tools like nvidia-smi, DCGM, and your inference server's metrics expose utilization, batch sizes, and queue depth.
If utilization is low, the problem is usually your serving software, not your hardware. Fixing batching alone often increases throughput several times over without buying any more GPUs.
Use an efficient inference server with continuous batching
The single biggest software lever is switching to an inference server that does continuous (in-flight) batching and efficient KV-cache management. Servers like vLLM (with PagedAttention), NVIDIA TensorRT-LLM, SGLang (with prefix-cache reuse), and TGI keep the GPU saturated by adding and removing requests from a running batch at the token level.
This packs many concurrent users onto one GPU, directly raising tokens-per-dollar.
Quantize to fit cheaper hardware
Quantization reduces the numeric precision of model weights (and sometimes the KV cache) from 16-bit down to 8-bit, 4-bit, or even lower. This shrinks memory footprint so a model that needed an expensive high-memory GPU can run on a cheaper one, and it often increases throughput because there is less data to move.
Common approaches include GPTQ, AWQ, and GGUF quantization, plus FP8 on supported NVIDIA hardware.
Quantization trades a small amount of quality for large cost savings. For many production tasks the quality drop at 8-bit is negligible and at 4-bit is acceptable. Always evaluate the quantized model on your own task before deploying — measure accuracy on a representative test set, not just perplexity.

Reach Kory White, Fractional CRO: 📅 Book a Quick Call · 💼 Kory on LinkedIn · 🏢 CRO Syndicate
Right-size the model to the task
Teams routinely serve a large flagship model for tasks a much smaller model handles well. Classification, extraction, routing, and simple summarization often run fine on a small open model that costs a fraction to serve. A practical pattern is model routing or cascading: send each request to the smallest model likely to succeed, and escalate to a larger model only when needed.
A small model serving 80% of traffic and a large model handling the hard 20% can cut serving cost dramatically versus sending everything to the large model. An AI gateway or router (LiteLLM, Portkey, or a custom classifier) makes this routing transparent to the application.
Cache aggressively
Many production prompts repeat or closely resemble earlier ones. Two layers of caching help:
- Exact-match cache returns a stored response when an identical prompt arrives — free and instant.
- Semantic cache uses embedding similarity to return a stored answer for a rephrased but equivalent question, catching far more hits than exact matching.
Caching eliminates the GPU work entirely for cached requests, cutting both cost and latency. For RAG and agent workloads with fixed system prompts, prefix caching (reusing the KV cache for the shared instruction prefix) saves recomputation on every request; SGLang and vLLM support this.
Improve throughput with the right batching and parallelism
Beyond turning on continuous batching, tune it. Larger batch sizes and longer scheduling windows raise throughput at some cost to latency, so set them to your service-level target rather than defaults. For large models, tensor parallelism splits a model across GPUs to fit it, while keeping per-GPU work efficient.
Match parallelism to the model size so you neither waste GPUs nor bottleneck on communication.
Also separate prefill (processing the prompt) from decode (generating tokens) where your server supports disaggregated serving, since these phases have different resource profiles and packing them well improves overall utilization.
Stop paying for idle GPUs
GPUs cost money whether or not they are working. Autoscaling replicas down during low traffic and up during peaks ensures you pay only for capacity you use; cluster layers like Ray Serve or Kubernetes with GPU-aware autoscaling handle this. For batch or fault-tolerant workloads, spot/interruptible GPU capacity from providers like RunPod, Vast.ai, or hyperscaler spot pools costs far less than on-demand — combine it with checkpointing and graceful draining so an interruption is harmless.
Finally, consider scale-to-zero for low-traffic models using serverless GPU endpoints, accepting a cold-start penalty in exchange for paying nothing when idle.
Frequently Asked Questions
What gives the biggest cost reduction fastest? Switching to a continuous-batching inference server like vLLM or TensorRT-LLM. If your GPUs run at low utilization today, batching alone can multiply throughput several times without new hardware, immediately lowering cost per token.
Does quantization hurt quality? A little, and usually not enough to matter. At 8-bit the quality loss is typically negligible; at 4-bit it is small for many tasks. Always validate the quantized model on your own evaluation set before deploying, since the impact is task-dependent.
How much can caching save? It depends on how repetitive your traffic is. Workloads with common questions or fixed system prompts can serve a meaningful share of requests from cache, eliminating GPU work entirely for those calls and cutting both cost and latency.
Should I use a smaller model or a quantized large model? Test both. A well-chosen small model is often cheaper and faster than a quantized large one for simple tasks, while a quantized large model preserves capability for hard tasks. Model routing lets you use each where it fits.
Is spot GPU capacity safe for serving? For latency-sensitive interactive serving, use it cautiously with graceful draining and fast failover, since instances can be reclaimed. For batch inference and background jobs with checkpointing, spot capacity is a reliable way to cut cost substantially.
How do I know if my GPUs are underutilized? Check GPU utilization (via nvidia-smi or DCGM) and tokens-per-second-per-GPU from your inference server's metrics. Low utilization under real traffic signals a batching or scheduling problem you can fix in software before spending on hardware.
Sources
- VLLM documentation on PagedAttention and continuous batching
- NVIDIA TensorRT-LLM and FP8 quantization documentation
- SGLang documentation on RadixAttention prefix caching
- GPTQ, AWQ, and GGUF quantization references
- NVIDIA DCGM and nvidia-smi GPU monitoring documentation
- Ray Serve and Kubernetes GPU autoscaling documentation
- RunPod and Vast.ai spot/interruptible GPU pricing documentation
