← Hub
Pulse ← Library ⚡ Hire a Fractional CRO
Pulse Reviews and Analysis

What is GPU memory fragmentation and how do you avoid it?

Kory WhiteCurated by Kory White · Fractional CRO, CRO Syndicate
👍 Yup or 👎 Nope — vote this up its category:
📅 Published · 8 min read
GPU memory fragmentation in AI workloads

What is GPU memory fragmentation and how do you avoid it?

Direct Answer

GPU memory fragmentation is wasted VRAM that you cannot use even though the total free amount looks sufficient. It happens because GPU memory is allocated and freed in differently-sized blocks, leaving gaps between live allocations. There are two flavors: external fragmentation, where free memory is split into many small non-contiguous chunks so a large allocation fails despite enough total free bytes (the classic CUDA out of memory: tried to allocate X, but Y free error), and internal fragmentation, where memory you reserved is only partly used (a request rounded up to a larger block, or a pre-allocated buffer sized for the worst case).

In LLM serving this is acute because the KV cache grows unpredictably with each request's sequence length. You avoid it with the right allocator and serving design: use a caching/pooled allocator (PyTorch's caching allocator with expandable_segments), use PagedAttention (vLLM) so the KV cache is allocated in fixed pages instead of one big contiguous block, set sensible memory pre-allocation fractions, avoid wildly variable allocation sizes, and periodically empty the cache only when truly necessary.

The goal is to keep allocations uniform and pooled so free memory stays usable rather than shattered into unusable gaps.

What fragmentation actually is

When a program allocates GPU memory, the allocator carves a region out of VRAM. When it frees that region, the space becomes available again — but only as a hole of a specific size in a specific location. Allocate and free many differently-sized tensors over time and the free memory becomes a patchwork of holes.

flowchart TD V[Total VRAM] --> A[Allocate A] A --> B[Allocate B] B --> FA[Free A leaves a hole] FA --> C[Allocate large C] C --> X{Fits in any single hole?} X -->|No, only scattered gaps| OOM[OOM despite free bytes] X -->|Yes| OK[Success]

Why LLM serving makes it worse

For training with fixed batch and sequence sizes, allocations are fairly uniform and fragmentation is manageable. LLM inference is different because the KV cache — the per-request memory holding attention keys and values — grows as the model generates tokens, and every request has a different, unpredictable length.

A naive server reserves a contiguous KV-cache block sized for the maximum sequence length for every request. That causes massive internal fragmentation (most requests are far shorter than the max) and external fragmentation (requests start and finish at different times, leaving variable holes).

Studies behind vLLM found naive KV-cache management wasted a large fraction of GPU memory this way, directly limiting how many concurrent requests a GPU could serve.

flowchart LR R1[Req 1 short] --> KV[KV cache pool] R2[Req 2 long] --> KV R3[Req 3 medium] --> KV KV --> PAG[PagedAttention: fixed pages, no contiguous reservation] PAG --> EFF[Near-zero KV waste, more concurrency]

Fix 1: Use a caching/pooled allocator

Frameworks do not call the driver for every tensor — they use a caching allocator that reserves big slabs of VRAM and sub-allocates from a pool, reusing freed blocks instead of returning them to the driver. This already reduces fragmentation, but you can tune it:

CRO Syndicate — Need a fractional Chief Revenue Officer? CRO Syndicate connects you with vetted fractional and interim revenue leaders. Kory White, Fractional CRO · 25 yrs · $0 to $200M scaled.

Reach Kory White, Fractional CRO: 📅 Book a Quick Call · 💼 Kory on LinkedIn · 🏢 CRO Syndicate

Fix 2: Use PagedAttention for KV cache

The biggest win for LLM serving is PagedAttention, introduced by vLLM and now adopted broadly. Inspired by virtual memory paging in operating systems, it stores the KV cache in fixed-size pages that need not be contiguous, instead of one large contiguous reservation per request:

If you serve LLMs, using an engine with paged KV cache (vLLM, TensorRT-LLM, SGLang, TGI) is the most important anti-fragmentation decision you can make.

Fix 3: Keep allocations uniform and pre-plan memory

Fragmentation thrives on variability, so reduce it:

Fix 4: Diagnose before you guess

Don't fight phantom fragmentation — measure it:

If memory_reserved is much larger than memory_allocated and you still OOM, fragmentation — not raw capacity — is your problem, and expandable_segments plus paged KV cache are the fixes.

Putting it together

To avoid GPU memory fragmentation: serve LLMs with a paged-KV-cache engine (vLLM/TensorRT-LLM/SGLang/TGI); enable PyTorch's expandable_segments and tune max_split_size_mb for variable workloads; pre-allocate a planned memory fraction rather than letting allocations grow ad hoc; bucket and pad inputs to keep allocation sizes uniform; lower precision with quantization to widen your headroom; and measure reserved-vs-allocated and KV utilization so you fix the real bottleneck.

Done together, these keep free VRAM contiguous and reusable, letting you pack more work onto each GPU without mysterious out-of-memory failures.

Frequently Asked Questions

Why do I get "CUDA out of memory" when nvidia-smi shows free memory? Because the free memory is fragmented — split into non-contiguous holes too small for the allocation you need, even though the total is sufficient. The framework's caching allocator may also be holding reserved memory it has not handed back.

Enabling PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True, reducing allocation-size variability, and using a paged KV cache usually resolves it.

What is the difference between internal and external fragmentation? Internal fragmentation is wasted space *inside* an allocation — you reserved more than you use, often because the request was rounded up to a block size or pre-sized for the worst case. External fragmentation is wasted space *between* allocations — free memory exists but is scattered in non-contiguous chunks, so a large contiguous request fails.

LLM KV caches can suffer from both.

How does PagedAttention reduce fragmentation? PagedAttention stores the KV cache in fixed-size, non-contiguous pages allocated on demand, like operating-system virtual memory. This removes the need to reserve a large contiguous block sized for the maximum sequence length per request, eliminating internal fragmentation, and makes freed pages instantly reusable by any request, eliminating external fragmentation.

The result is far higher concurrency on the same GPU.

Does calling torch.cuda.empty_cache() fix fragmentation? Only partially, and it can hurt performance if overused. empty_cache() returns cached-but-unused blocks to the driver, which can free up contiguous space, but the framework then has to re-acquire memory, adding latency.

Use it sparingly between distinct workload phases. For ongoing variable workloads, expandable_segments and paged KV caching are better, structural fixes.

How do I tell if fragmentation, not capacity, is my problem? Compare torch.cuda.memory_reserved() (what the allocator holds) with torch.cuda.memory_allocated() (what is actually in use). A large gap, combined with OOM errors, indicates fragmentation rather than a genuine shortage of memory.

PyTorch's memory snapshot/visualizer shows the allocation timeline and gaps, and engine KV-cache utilization metrics reveal whether the serving layer is the cause.

Does quantization help with fragmentation? Indirectly, yes. Quantizing weights and the KV cache (for example to FP8 or INT8) shrinks each allocation, so you have more free headroom and are less likely to hit the edge where fragmentation causes an OOM. It does not fix the structural cause of fragmentation, but it buys margin and lets you serve more concurrent requests, which is often the practical goal.

Sources

Keep reading
Was this helpful?  
Related in the library
More from the library
pulse-ai-infrastructure · ai-infrastructureThe 10 Best Model Compression Tools in 2027pulse-speeches · speechesWhat Makes Theodore Roosevelt’s “The Man in the Arena” a Great Speechpulse-aquariums · aquariumTop 10 Aquarium Sand Substrates for Saltwater Tanks in 2027pulse-ai-infrastructure · ai-infrastructureThe 10 Best GPU Monitoring Tools in 2027revops · current-events-2027Which vendor consolidation strategies are failing most often when integrating AI sales tools into existing stacks?pulse-aquariums · aquariumWhat is the ideal water temperature for a tropical community tank?pulse-ai-infrastructure · ai-infrastructureThe 10 Best Time-Series Databases for AI in 2027pulse-aquariums · aquariumTop 10 Aquarium Plant Grow Lights in 2027pulse-aquariums · aquariumWhat is the nitrogen cycle in an aquarium?pulse-ai-infrastructure · ai-infrastructureThe 10 Best AI Model Monitoring Tools in 2027pulse-aquariums · aquariumWhat is old tank syndrome and how do you avoid it?pulse-ai-infrastructure · ai-infrastructureHow do you build data pipelines for continuous model training?pulse-ai-infrastructure · ai-infrastructureWhat is the best architecture for multi-tenant AI applications?pulse-speeches · speechesHow to Structure a Best Man Speechpulse-aquariums · aquariumTop 10 Aquarium Heaters for Large Tanks in 2027