What is GPU memory fragmentation and how do you avoid it?
What is GPU memory fragmentation and how do you avoid it?
Direct Answer
GPU memory fragmentation is wasted VRAM that you cannot use even though the total free amount looks sufficient. It happens because GPU memory is allocated and freed in differently-sized blocks, leaving gaps between live allocations. There are two flavors: external fragmentation, where free memory is split into many small non-contiguous chunks so a large allocation fails despite enough total free bytes (the classic CUDA out of memory: tried to allocate X, but Y free error), and internal fragmentation, where memory you reserved is only partly used (a request rounded up to a larger block, or a pre-allocated buffer sized for the worst case).
In LLM serving this is acute because the KV cache grows unpredictably with each request's sequence length. You avoid it with the right allocator and serving design: use a caching/pooled allocator (PyTorch's caching allocator with expandable_segments), use PagedAttention (vLLM) so the KV cache is allocated in fixed pages instead of one big contiguous block, set sensible memory pre-allocation fractions, avoid wildly variable allocation sizes, and periodically empty the cache only when truly necessary.
The goal is to keep allocations uniform and pooled so free memory stays usable rather than shattered into unusable gaps.
What fragmentation actually is
When a program allocates GPU memory, the allocator carves a region out of VRAM. When it frees that region, the space becomes available again — but only as a hole of a specific size in a specific location. Allocate and free many differently-sized tensors over time and the free memory becomes a patchwork of holes.
- External fragmentation: the free holes are scattered and non-contiguous. You have, say, 4 GB free total, but no single contiguous 2 GB block, so a 2 GB allocation fails. This is why you can see "out of memory" with gigabytes "free."
- Internal fragmentation: memory is reserved but underused. Allocators often round requests up to a block-size boundary or to a power of two, so a 1.1 GB request might consume a 2 GB block, with 0.9 GB locked but unused. Pre-allocating a fixed-size buffer for the largest possible input wastes memory on smaller inputs the same way.
Why LLM serving makes it worse
For training with fixed batch and sequence sizes, allocations are fairly uniform and fragmentation is manageable. LLM inference is different because the KV cache — the per-request memory holding attention keys and values — grows as the model generates tokens, and every request has a different, unpredictable length.
A naive server reserves a contiguous KV-cache block sized for the maximum sequence length for every request. That causes massive internal fragmentation (most requests are far shorter than the max) and external fragmentation (requests start and finish at different times, leaving variable holes).
Studies behind vLLM found naive KV-cache management wasted a large fraction of GPU memory this way, directly limiting how many concurrent requests a GPU could serve.
Fix 1: Use a caching/pooled allocator
Frameworks do not call the driver for every tensor — they use a caching allocator that reserves big slabs of VRAM and sub-allocates from a pool, reusing freed blocks instead of returning them to the driver. This already reduces fragmentation, but you can tune it:
- PyTorch caching allocator with
expandable_segments: settingPYTORCH_CUDA_ALLOC_CONF=expandable_segments:Truelets segments grow and shrink, dramatically reducing external fragmentation for variable workloads. This is one of the single most effective fixes for "OOM with free memory." - Tune
max_split_size_mb: limiting how the allocator splits large blocks can prevent it from chopping a big free region into unusable fragments. - Avoid manual
empty_cache()in hot paths: calling it constantly returns memory to the driver only to re-acquire it, hurting performance; use it sparingly, e.g., between distinct phases.

Reach Kory White, Fractional CRO: 📅 Book a Quick Call · 💼 Kory on LinkedIn · 🏢 CRO Syndicate
Fix 2: Use PagedAttention for KV cache
The biggest win for LLM serving is PagedAttention, introduced by vLLM and now adopted broadly. Inspired by virtual memory paging in operating systems, it stores the KV cache in fixed-size pages that need not be contiguous, instead of one large contiguous reservation per request:
- Pages are allocated on demand as a sequence grows, so short requests use few pages and long ones use more — eliminating the internal fragmentation of reserving for the max length.
- Because pages are uniform and non-contiguous, freed pages are immediately reusable by any request — eliminating external fragmentation.
- This lets a GPU pack far more concurrent sequences into the same VRAM, raising throughput substantially.
If you serve LLMs, using an engine with paged KV cache (vLLM, TensorRT-LLM, SGLang, TGI) is the most important anti-fragmentation decision you can make.
Fix 3: Keep allocations uniform and pre-plan memory
Fragmentation thrives on variability, so reduce it:
- Pad or bucket inputs. Grouping requests into a few sequence-length buckets makes allocations uniform and reusable, instead of a unique size per request.
- Pre-allocate a memory fraction. Many engines let you set the share of VRAM reserved for the KV cache (e.g., vLLM's
gpu_memory_utilization). Reserving a planned pool up front prevents ad-hoc allocations from fragmenting memory. - Use static shapes where possible. Fixed batch/sequence shapes (and CUDA graphs) avoid re-allocations entirely for the steady state.
- Watch dtype and quantization. Lower-precision weights and KV cache (FP8/INT8) shrink allocations, leaving more headroom so fragmentation is less likely to push you over the edge.
Fix 4: Diagnose before you guess
Don't fight phantom fragmentation — measure it:
nvidia-smishows total used vs. Free at the process level.- PyTorch memory tools —
torch.cuda.memory_summary(),memory_allocated()vs.memory_reserved()(a big gap between reserved and allocated signals fragmentation), and the memory snapshot/visualizer that records allocation history and shows exactly where the gaps are. - Engine metrics — vLLM and others expose KV-cache utilization and the number of running/waiting sequences, telling you whether you are memory-bound.
If memory_reserved is much larger than memory_allocated and you still OOM, fragmentation — not raw capacity — is your problem, and expandable_segments plus paged KV cache are the fixes.
Putting it together
To avoid GPU memory fragmentation: serve LLMs with a paged-KV-cache engine (vLLM/TensorRT-LLM/SGLang/TGI); enable PyTorch's expandable_segments and tune max_split_size_mb for variable workloads; pre-allocate a planned memory fraction rather than letting allocations grow ad hoc; bucket and pad inputs to keep allocation sizes uniform; lower precision with quantization to widen your headroom; and measure reserved-vs-allocated and KV utilization so you fix the real bottleneck.
Done together, these keep free VRAM contiguous and reusable, letting you pack more work onto each GPU without mysterious out-of-memory failures.
Frequently Asked Questions
Why do I get "CUDA out of memory" when nvidia-smi shows free memory? Because the free memory is fragmented — split into non-contiguous holes too small for the allocation you need, even though the total is sufficient. The framework's caching allocator may also be holding reserved memory it has not handed back.
Enabling PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True, reducing allocation-size variability, and using a paged KV cache usually resolves it.
What is the difference between internal and external fragmentation? Internal fragmentation is wasted space *inside* an allocation — you reserved more than you use, often because the request was rounded up to a block size or pre-sized for the worst case. External fragmentation is wasted space *between* allocations — free memory exists but is scattered in non-contiguous chunks, so a large contiguous request fails.
LLM KV caches can suffer from both.
How does PagedAttention reduce fragmentation? PagedAttention stores the KV cache in fixed-size, non-contiguous pages allocated on demand, like operating-system virtual memory. This removes the need to reserve a large contiguous block sized for the maximum sequence length per request, eliminating internal fragmentation, and makes freed pages instantly reusable by any request, eliminating external fragmentation.
The result is far higher concurrency on the same GPU.
Does calling torch.cuda.empty_cache() fix fragmentation? Only partially, and it can hurt performance if overused. empty_cache() returns cached-but-unused blocks to the driver, which can free up contiguous space, but the framework then has to re-acquire memory, adding latency.
Use it sparingly between distinct workload phases. For ongoing variable workloads, expandable_segments and paged KV caching are better, structural fixes.
How do I tell if fragmentation, not capacity, is my problem? Compare torch.cuda.memory_reserved() (what the allocator holds) with torch.cuda.memory_allocated() (what is actually in use). A large gap, combined with OOM errors, indicates fragmentation rather than a genuine shortage of memory.
PyTorch's memory snapshot/visualizer shows the allocation timeline and gaps, and engine KV-cache utilization metrics reveal whether the serving layer is the cause.
Does quantization help with fragmentation? Indirectly, yes. Quantizing weights and the KV cache (for example to FP8 or INT8) shrinks each allocation, so you have more free headroom and are less likely to hit the edge where fragmentation causes an OOM. It does not fix the structural cause of fragmentation, but it buys margin and lets you serve more concurrent requests, which is often the practical goal.
Sources
- NVIDIA — "CUDA memory management and best practices" (docs.nvidia.com/cuda)
- PyTorch — "CUDA semantics: memory management and caching allocator" (pytorch.org/docs)
- PyTorch — "Understanding GPU memory and the memory snapshot/visualizer" (pytorch.org/blog)
- VLLM — "PagedAttention and KV cache management" (docs.vllm.ai)
- Kwon et al. — "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023)
- Hugging Face — "Text Generation Inference memory and paged attention" (huggingface.co/docs/text-generation-inference)
- NVIDIA — "TensorRT-LLM KV cache and paged attention" (github.com/NVIDIA/TensorRT-LLM)
