How do you measure and improve GPU utilization?

Question

Pulse RevOps · The Machine · Accepted Answer

![How do you measure and improve GPU utilization?](https://techinspection.net/wp-content/uploads/2023/01/How-To-Measure-Gpu-Usage-2.png)

# How do you measure and improve GPU utilization?

### Direct Answer
You measure GPU utilization with profiling and monitoring tools — `nvidia-smi` and DCGM for live device metrics, and PyTorch Profiler or Nsight for in-job analysis — but the key is to look past the headline "GPU utilization %" to the metrics that actually reflect useful work: **streaming-multiprocessor (SM) occupancy, memory bandwidth, Tensor Core usage, and memory consumption**. A GPU can report 100% "utilization" while doing almost no math because it is waiting on data. You improve utilization by removing whatever is starving the GPU: feed it faster (better data loading and batching), keep it busy (larger batches, continuous batching for inference, gradient accumulation), use it efficiently (mixed precision, Tensor Cores, the right kernels), and share idle capacity (MIG, time-slicing, multi-tenant scheduling). The goal is high *useful* throughput per dollar, not just a high number on a dashboard.

## What "GPU utilization" really means

The `nvidia-smi` "GPU-Util" figure is the percentage of time over a sample window during which at least one kernel was running. It says nothing about *how much* of the GPU was used. A trivial kernel that reads one value can keep that number at 100% while the thousands of cores and Tensor Cores sit idle. That is why teams routinely see a GPU "fully utilized" yet delivering a fraction of its theoretical throughput. To measure real efficiency you need deeper signals:

- **SM occupancy / activity:** how busy the compute units actually are.
- **Tensor Core utilization:** whether the high-throughput matrix units (which do most ML math) are being used.
- **Memory bandwidth utilization:** many ML kernels are memory-bound, so this often gates throughput.
- **GPU memory usage:** how much VRAM the workload consumes, which constrains batch size and model size.
- **Throughput:** the business metric — samples/sec in training, tokens/sec in inference.

```mermaid
flowchart LR
    SMI[nvidia-smi GPU-Util %] -->|misleading| REAL[Real efficiency signals]
    REAL --> SM[SM occupancy]
    REAL --> TC[Tensor Core usage]
    REAL --> BW[Memory bandwidth]
    REAL --> TPUT[Throughput: samples or tokens/sec]
```

## How to measure it

Use the right tool for the right altitude:

- **`nvidia-smi`:** quick live check of utilization, memory, temperature, and power — good for spotting an idle or out-of-memory GPU, not for deep analysis.
- **NVIDIA DCGM (Data Center GPU Manager):** fleet-wide metrics — SM activity, Tensor Core usage, memory bandwidth, power — that you export to Prometheus and visualize in Grafana for continuous monitoring across a cluster.
- **PyTorch Profiler / TensorBoard:** in-training analysis showing where time goes — compute vs. Data loading vs. Communication — and flagging GPU stalls.
- **NVIDIA Nsight Systems / Nsight Compute:** the deepest profilers, for tracing kernels and the CPU-GPU timeline to find bottlenecks and underused Tensor Cores.
- **DCGM exporter + Prometheus + Grafana** is the standard stack for ongoing cluster observability; for Kubernetes, the GPU Operator wires much of this up automatically.

Establish a baseline, then optimize against it and re-measure — utilization work without measurement is guessing.

[![CRO Syndicate — Need a fractional Chief Revenue Officer? CRO Syndicate connects you with vetted fractional and interim revenue leaders. Kory White, Fractional CRO · 25 yrs · $0 to $200M scaled.](https://wsrv.nl/?url=files.catbox.moe/usgv65.png&w=1280&output=webp)](https://calendly.com/korywhiterevops)

**Reach Kory White, Fractional CRO:** [📅 Book a Quick Call](https://calendly.com/korywhiterevops) · [💼 Kory on LinkedIn](https://www.linkedin.com/in/korywhite) · [🏢 CRO Syndicate](https://crosyndicate.com/)

## Why GPUs sit idle

Most low utilization traces to the GPU waiting on something else:

- **Data loading (input pipeline) bottleneck:** the CPU cannot prepare and feed batches fast enough, so the GPU stalls between steps. This is the single most common cause in training.
- **Small batch sizes:** batches too small to fill the GPU leave compute units underused.
- **CPU-bound preprocessing or Python overhead** between GPU operations.
- **Communication overhead** in distributed training, where GPUs wait on gradient synchronization across nodes.
- **Memory limits** forcing tiny batches, or memory fragmentation causing out-of-memory and inefficient allocation.
- **For inference, low or bursty request volume** that leaves the GPU idle between requests, or naive serving that processes one request at a time.

```mermaid
flowchart TD
    IDLE[GPU underused] --> DATA{Waiting on data?}
    DATA -->|Yes| PIPE[Fix input pipeline / batching]
    DATA -->|No| COMM{Waiting on comms?}
    COMM -->|Yes| DIST[Optimize distributed comms]
    COMM -->|No| KERNEL[Inef

How do you measure and improve GPU utilization?

How do you measure and improve GPU utilization?

Direct Answer

What "GPU utilization" really means

How to measure it

Why GPUs sit idle

How to improve training utilization

How to improve inference utilization

Pulling it together

Frequently Asked Questions

Sources

How do you measure and improve GPU utilization?

How do you measure and improve GPU utilization?

Direct Answer

What "GPU utilization" really means

How to measure it

Why GPUs sit idle

How to improve training utilization

How to improve inference utilization

Pulling it together

Frequently Asked Questions

Sources

What does the score mean?