← Hub
Pulse ← Library ⚡ Hire a Fractional CRO
Pulse Reviews and Analysis

How do you measure and improve GPU utilization?

Kory WhiteCurated by Kory White · Fractional CRO, CRO Syndicate
👍 Yup or 👎 Nope — vote this up its category:
📅 Published · Updated · 7 min read
How do you measure and improve GPU utilization?

How do you measure and improve GPU utilization?

Direct Answer

You measure GPU utilization with profiling and monitoring tools — nvidia-smi and DCGM for live device metrics, and PyTorch Profiler or Nsight for in-job analysis — but the key is to look past the headline "GPU utilization %" to the metrics that actually reflect useful work: streaming-multiprocessor (SM) occupancy, memory bandwidth, Tensor Core usage, and memory consumption.

A GPU can report 100% "utilization" while doing almost no math because it is waiting on data. You improve utilization by removing whatever is starving the GPU: feed it faster (better data loading and batching), keep it busy (larger batches, continuous batching for inference, gradient accumulation), use it efficiently (mixed precision, Tensor Cores, the right kernels), and share idle capacity (MIG, time-slicing, multi-tenant scheduling).

The goal is high *useful* throughput per dollar, not just a high number on a dashboard.

What "GPU utilization" really means

The nvidia-smi "GPU-Util" figure is the percentage of time over a sample window during which at least one kernel was running. It says nothing about *how much* of the GPU was used. A trivial kernel that reads one value can keep that number at 100% while the thousands of cores and Tensor Cores sit idle.

That is why teams routinely see a GPU "fully utilized" yet delivering a fraction of its theoretical throughput. To measure real efficiency you need deeper signals:

flowchart LR SMI[nvidia-smi GPU-Util %] -->|misleading| REAL[Real efficiency signals] REAL --> SM[SM occupancy] REAL --> TC[Tensor Core usage] REAL --> BW[Memory bandwidth] REAL --> TPUT[Throughput: samples or tokens/sec]

How to measure it

Use the right tool for the right altitude:

Establish a baseline, then optimize against it and re-measure — utilization work without measurement is guessing.

CRO Syndicate — Need a fractional Chief Revenue Officer? CRO Syndicate connects you with vetted fractional and interim revenue leaders. Kory White, Fractional CRO · 25 yrs · $0 to $200M scaled.

Reach Kory White, Fractional CRO: 📅 Book a Quick Call · 💼 Kory on LinkedIn · 🏢 CRO Syndicate

Why GPUs sit idle

Most low utilization traces to the GPU waiting on something else:

flowchart TD IDLE[GPU underused] --> DATA{Waiting on data?} DATA -->|Yes| PIPE[Fix input pipeline / batching] DATA -->|No| COMM{Waiting on comms?} COMM -->|Yes| DIST[Optimize distributed comms] COMM -->|No| KERNEL[Inefficient kernels / precision] KERNEL --> AMP[Mixed precision + Tensor Cores]

How to improve training utilization

How to improve inference utilization

Pulling it together

Improving GPU utilization is a measure-diagnose-fix loop. First measure the *right* signals — SM occupancy, Tensor Core and memory-bandwidth usage, and throughput — not just the misleading nvidia-smi percentage, using DCGM, PyTorch Profiler, and Nsight. Then diagnose what starves the GPU: usually data loading, small batches, communication, or one-at-a-time inference.

Then fix it with faster data pipelines, larger or accumulated batches, mixed precision and Tensor Cores, continuous batching for serving, and GPU sharing for small jobs. Re-measure after each change. Done well, this routinely doubles or triples useful throughput on the same hardware — the cheapest capacity you will ever add.

Frequently Asked Questions

Why does nvidia-smi show 100% utilization but training is slow? Because GPU-Util only reports whether a kernel was running, not how much of the GPU was used. A data-starved loop can keep a trivial kernel busy at 100% while the compute and Tensor Cores idle. Profile with PyTorch Profiler or DCGM to see SM occupancy and Tensor Core usage, which reveal the real picture.

What's the single most common cause of low GPU utilization in training? The input data pipeline. If CPU-side data loading and preprocessing cannot keep up, the GPU stalls waiting for the next batch. Increasing data-loader workers, enabling prefetching and pinned memory, and pre-processing or caching data usually delivers the largest, easiest gains.

How does mixed precision improve utilization? Training in FP16 or BF16 uses the GPU's Tensor Cores, which are far faster than standard FP32 units for matrix math, and it halves memory use so you can run larger batches. Together this often roughly doubles throughput with negligible accuracy impact, which is why automatic mixed precision (AMP) is standard practice.

What improves GPU utilization for LLM inference specifically? Continuous batching combined with paged attention, as implemented in vLLM, SGLang, and TensorRT-LLM. Instead of serving one request at a time, the server batches many in-flight sequences and manages the KV cache efficiently, dramatically raising tokens-per-second per GPU.

Quantization and right-sized autoscaling add further gains.

What tools should I use to monitor GPU utilization across a cluster? NVIDIA DCGM with its Prometheus exporter, visualized in Grafana, is the standard for fleet-wide GPU monitoring — SM activity, Tensor Core usage, memory bandwidth, power, and temperature. On Kubernetes, the NVIDIA GPU Operator can deploy DCGM and exporters automatically, and you can alert on idle or saturated GPUs.

Can I share one GPU across multiple workloads? Yes. Multi-Instance GPU (MIG) on supported NVIDIA hardware partitions one physical GPU into isolated instances, and time-slicing lets multiple pods share a GPU sequentially. This raises utilization for many small workloads — like serving several lightweight models — so you buy fewer GPUs overall, at the cost of some isolation and scheduling complexity.

Sources

Keep reading
Was this helpful?  
Related in the library
More from the library
pulse-aquariums · aquariumHow do you treat fin rot in aquarium fish?pulse-ai-infrastructure · ai-infrastructureWhat is distributed training and when do you need it?pulse-speeches · speechesWhat Makes Sojourner Truth’s “Ain’t I a Woman?” a Great Speechpulse-ai-infrastructure · ai-infrastructureWhat is confidential computing and why does it matter for AI?pulse-ai-infrastructure · ai-infrastructureThe 10 Best Infrastructure-as-Code Tools for AI Platforms in 2027pulse-aquariums · aquariumHow do you lower nitrates in a reef tank?pulse-aquariums · aquariumTop 10 RO/DI Systems for Reef Keepers in 2027pulse-ai-infrastructure · ai-infrastructureThe 10 Best LLMOps Platforms in 2027pulse-speeches · speechesHow to Beat Public-Speaking Nervespulse-aquariums · aquariumHow do you treat ich in a freshwater aquarium?pulse-ai-infrastructure · ai-infrastructureHow do you evaluate LLM output quality at scale?pulse-aquariums · aquariumHow do you keep aquarium plants from melting after planting?pulse-aquariums · aquariumHow do you breed betta fish?