How do you measure and improve GPU utilization?

How do you measure and improve GPU utilization?
Direct Answer
You measure GPU utilization with profiling and monitoring tools — nvidia-smi and DCGM for live device metrics, and PyTorch Profiler or Nsight for in-job analysis — but the key is to look past the headline "GPU utilization %" to the metrics that actually reflect useful work: streaming-multiprocessor (SM) occupancy, memory bandwidth, Tensor Core usage, and memory consumption.
A GPU can report 100% "utilization" while doing almost no math because it is waiting on data. You improve utilization by removing whatever is starving the GPU: feed it faster (better data loading and batching), keep it busy (larger batches, continuous batching for inference, gradient accumulation), use it efficiently (mixed precision, Tensor Cores, the right kernels), and share idle capacity (MIG, time-slicing, multi-tenant scheduling).
The goal is high *useful* throughput per dollar, not just a high number on a dashboard.
What "GPU utilization" really means
The nvidia-smi "GPU-Util" figure is the percentage of time over a sample window during which at least one kernel was running. It says nothing about *how much* of the GPU was used. A trivial kernel that reads one value can keep that number at 100% while the thousands of cores and Tensor Cores sit idle.
That is why teams routinely see a GPU "fully utilized" yet delivering a fraction of its theoretical throughput. To measure real efficiency you need deeper signals:
- SM occupancy / activity: how busy the compute units actually are.
- Tensor Core utilization: whether the high-throughput matrix units (which do most ML math) are being used.
- Memory bandwidth utilization: many ML kernels are memory-bound, so this often gates throughput.
- GPU memory usage: how much VRAM the workload consumes, which constrains batch size and model size.
- Throughput: the business metric — samples/sec in training, tokens/sec in inference.
How to measure it
Use the right tool for the right altitude:
nvidia-smi: quick live check of utilization, memory, temperature, and power — good for spotting an idle or out-of-memory GPU, not for deep analysis.- NVIDIA DCGM (Data Center GPU Manager): fleet-wide metrics — SM activity, Tensor Core usage, memory bandwidth, power — that you export to Prometheus and visualize in Grafana for continuous monitoring across a cluster.
- PyTorch Profiler / TensorBoard: in-training analysis showing where time goes — compute vs. Data loading vs. Communication — and flagging GPU stalls.
- NVIDIA Nsight Systems / Nsight Compute: the deepest profilers, for tracing kernels and the CPU-GPU timeline to find bottlenecks and underused Tensor Cores.
- DCGM exporter + Prometheus + Grafana is the standard stack for ongoing cluster observability; for Kubernetes, the GPU Operator wires much of this up automatically.
Establish a baseline, then optimize against it and re-measure — utilization work without measurement is guessing.

Reach Kory White, Fractional CRO: 📅 Book a Quick Call · 💼 Kory on LinkedIn · 🏢 CRO Syndicate
Why GPUs sit idle
Most low utilization traces to the GPU waiting on something else:
- Data loading (input pipeline) bottleneck: the CPU cannot prepare and feed batches fast enough, so the GPU stalls between steps. This is the single most common cause in training.
- Small batch sizes: batches too small to fill the GPU leave compute units underused.
- CPU-bound preprocessing or Python overhead between GPU operations.
- Communication overhead in distributed training, where GPUs wait on gradient synchronization across nodes.
- Memory limits forcing tiny batches, or memory fragmentation causing out-of-memory and inefficient allocation.
- For inference, low or bursty request volume that leaves the GPU idle between requests, or naive serving that processes one request at a time.
How to improve training utilization
- Fix the input pipeline: use multiple data-loader workers, prefetching, and pinned memory so batches are ready before the GPU finishes the previous step. In PyTorch, tune
DataLoadernum_workers,pin_memory, and prefetching; cache or pre-process data to remove per-step CPU work. - Increase effective batch size: larger batches fill the GPU; if memory limits you, use gradient accumulation to simulate a bigger batch.
- Use mixed precision (AMP): training in FP16/BF16 engages Tensor Cores, roughly doubling throughput and halving memory versus FP32, usually with no accuracy loss.
- Optimize the model and kernels: fused kernels,
torch.compile, FlashAttention, and efficient operators reduce overhead and memory traffic. - Tune distributed training: overlap computation with gradient communication, use efficient strategies (FSDP, DeepSpeed ZeRO), and ensure interconnect (NVLink/InfiniBand) is not the bottleneck.
How to improve inference utilization
- Continuous (in-flight) batching: servers like vLLM, SGLang, and TensorRT-LLM batch incoming requests dynamically so the GPU processes many sequences together instead of one at a time — the biggest single win for LLM serving utilization.
- Paged attention and KV-cache management (as in vLLM) reduce memory waste and let you serve more concurrent requests per GPU.
- Quantization: 8-bit or 4-bit weights cut memory so you can use bigger batches or smaller, cheaper GPUs.
- Right-size and autoscale: match GPU type to the model, and scale replicas to demand so you are not running large idle GPUs; scale-to-zero for spiky low-traffic models.
- GPU sharing for small models: with Multi-Instance GPU (MIG) or time-slicing, several light workloads share one physical GPU instead of each monopolizing one.
Pulling it together
Improving GPU utilization is a measure-diagnose-fix loop. First measure the *right* signals — SM occupancy, Tensor Core and memory-bandwidth usage, and throughput — not just the misleading nvidia-smi percentage, using DCGM, PyTorch Profiler, and Nsight. Then diagnose what starves the GPU: usually data loading, small batches, communication, or one-at-a-time inference.
Then fix it with faster data pipelines, larger or accumulated batches, mixed precision and Tensor Cores, continuous batching for serving, and GPU sharing for small jobs. Re-measure after each change. Done well, this routinely doubles or triples useful throughput on the same hardware — the cheapest capacity you will ever add.
Frequently Asked Questions
Why does nvidia-smi show 100% utilization but training is slow? Because GPU-Util only reports whether a kernel was running, not how much of the GPU was used. A data-starved loop can keep a trivial kernel busy at 100% while the compute and Tensor Cores idle. Profile with PyTorch Profiler or DCGM to see SM occupancy and Tensor Core usage, which reveal the real picture.
What's the single most common cause of low GPU utilization in training? The input data pipeline. If CPU-side data loading and preprocessing cannot keep up, the GPU stalls waiting for the next batch. Increasing data-loader workers, enabling prefetching and pinned memory, and pre-processing or caching data usually delivers the largest, easiest gains.
How does mixed precision improve utilization? Training in FP16 or BF16 uses the GPU's Tensor Cores, which are far faster than standard FP32 units for matrix math, and it halves memory use so you can run larger batches. Together this often roughly doubles throughput with negligible accuracy impact, which is why automatic mixed precision (AMP) is standard practice.
What improves GPU utilization for LLM inference specifically? Continuous batching combined with paged attention, as implemented in vLLM, SGLang, and TensorRT-LLM. Instead of serving one request at a time, the server batches many in-flight sequences and manages the KV cache efficiently, dramatically raising tokens-per-second per GPU.
Quantization and right-sized autoscaling add further gains.
What tools should I use to monitor GPU utilization across a cluster? NVIDIA DCGM with its Prometheus exporter, visualized in Grafana, is the standard for fleet-wide GPU monitoring — SM activity, Tensor Core usage, memory bandwidth, power, and temperature. On Kubernetes, the NVIDIA GPU Operator can deploy DCGM and exporters automatically, and you can alert on idle or saturated GPUs.
Can I share one GPU across multiple workloads? Yes. Multi-Instance GPU (MIG) on supported NVIDIA hardware partitions one physical GPU into isolated instances, and time-slicing lets multiple pods share a GPU sequentially. This raises utilization for many small workloads — like serving several lightweight models — so you buy fewer GPUs overall, at the cost of some isolation and scheduling complexity.
Sources
- NVIDIA — nvidia-smi and DCGM documentation (developer.nvidia.com)
- NVIDIA — Nsight Systems and Nsight Compute profilers (developer.nvidia.com/nsight-systems)
- PyTorch — Profiler and performance tuning guide (pytorch.org)
- VLLM — continuous batching and paged attention (docs.vllm.ai)
- NVIDIA — Multi-Instance GPU (MIG) user guide (docs.nvidia.com)
- NVIDIA — GPU Operator for Kubernetes (docs.nvidia.com)
- Prometheus / Grafana with DCGM exporter (github.com/NVIDIA/dcgm-exporter)
