What is GPU memory fragmentation and how do you avoid it?

Curated by Kory White · Fractional CRO, CRO Syndicate

👍 Yup or 👎 Nope — vote this up its category:

📅 Published Jun 27, 2026 · 8 min read

GPU memory fragmentation in AI workloads

What is GPU memory fragmentation and how do you avoid it?

Direct Answer

GPU memory fragmentation is wasted VRAM that you cannot use even though the total free amount looks sufficient. It happens because GPU memory is allocated and freed in differently-sized blocks, leaving gaps between live allocations. There are two flavors: external fragmentation, where free memory is split into many small non-contiguous chunks so a large allocation fails despite enough total free bytes (the classic CUDA out of memory: tried to allocate X, but Y free error), and internal fragmentation, where memory you reserved is only partly used (a request rounded up to a larger block, or a pre-allocated buffer sized for the worst case).

In LLM serving this is acute because the KV cache grows unpredictably with each request's sequence length. You avoid it with the right allocator and serving design: use a caching/pooled allocator (PyTorch's caching allocator with expandable_segments), use PagedAttention (vLLM) so the KV cache is allocated in fixed pages instead of one big contiguous block, set sensible memory pre-allocation fractions, avoid wildly variable allocation sizes, and periodically empty the cache only when truly necessary.

The goal is to keep allocations uniform and pooled so free memory stays usable rather than shattered into unusable gaps.

What fragmentation actually is

When a program allocates GPU memory, the allocator carves a region out of VRAM. When it frees that region, the space becomes available again — but only as a hole of a specific size in a specific location. Allocate and free many differently-sized tensors over time and the free memory becomes a patchwork of holes.

External fragmentation: the free holes are scattered and non-contiguous. You have, say, 4 GB free total, but no single contiguous 2 GB block, so a 2 GB allocation fails. This is why you can see "out of memory" with gigabytes "free."
Internal fragmentation: memory is reserved but underused. Allocators often round requests up to a block-size boundary or to a power of two, so a 1.1 GB request might consume a 2 GB block, with 0.9 GB locked but unused. Pre-allocating a fixed-size buffer for the largest possible input wastes memory on smaller inputs the same way.

flowchart TD V[Total VRAM] --> A[Allocate A] A --> B[Allocate B] B --> FA[Free A leaves a hole] FA --> C[Allocate large C] C --> X{Fits in any single hole?} X -->|No, only scattered gaps| OOM[OOM despite free bytes] X -->|Yes| OK[Success]

Why LLM serving makes it worse

For training with fixed batch and sequence sizes, allocations are fairly uniform and fragmentation is manageable. LLM inference is different because the KV cache — the per-request memory holding attention keys and values — grows as the model generates tokens, and every request has a different, unpredictable length.

A naive server reserves a contiguous KV-cache block sized for the maximum sequence length for every request. That causes massive internal fragmentation (most requests are far shorter than the max) and external fragmentation (requests start and finish at different times, leaving variable holes).

Studies behind vLLM found naive KV-cache management wasted a large fraction of GPU memory this way, directly limiting how many concurrent requests a GPU could serve.

flowchart LR R1[Req 1 short] --> KV[KV cache pool] R2[Req 2 long] --> KV R3[Req 3 medium] --> KV KV --> PAG[PagedAttention: fixed pages, no contiguous reservation] PAG --> EFF[Near-zero KV waste, more concurrency]

Fix 1: Use a caching/pooled allocator

Frameworks do not call the driver for every tensor — they use a caching allocator that reserves big slabs of VRAM and sub-allocates from a pool, reusing freed blocks instead of returning them to the driver. This already reduces fragmentation, but you can tune it:

PyTorch caching allocator with expandable_segments: setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True lets segments grow and shrink, dramatically reducing external fragmentation for variable workloads. This is one of the single most effective fixes for "OOM with free memory."
Tune max_split_size_mb: limiting how the allocator splits large blocks can prevent it from chopping a big free region into unusable fragments.
Avoid manual empty_cache() in hot paths: calling it constantly returns memory to the driver only to re-acquire it, hurting performance; use it sparingly, e.g., between distinct phases.

CRO Syndicate — Need a fractional Chief Revenue Officer? CRO Syndicate connects you with vetted fractional and interim revenue leaders. Kory White, Fractional CRO · 25 yrs · $0 to $200M scaled.

Reach Kory White, Fractional CRO: 📅 Book a Quick Call · 💼 Kory on LinkedIn · 🏢 CRO Syndicate

Fix 2: Use PagedAttention for KV cache

The biggest win for LLM serving is PagedAttention, introduced by vLLM and now adopted broadly. Inspired by virtual memory paging in operating systems, it stores the KV cache in fixed-size pages that need not be contiguous, instead of one large contiguous reservation per request:

Pages are allocated on demand as a sequence grows, so short requests use few pages and long ones use more — eliminating the internal fragmentation of reserving for the max length.
Because pages are uniform and non-contiguous, freed pages are immediately reusable by any request — eliminating external fragmentation.
This lets a GPU pack far more concurrent sequences into the same VRAM, raising throughput substantially.

If you serve LLMs, using an engine with paged KV cache (vLLM, TensorRT-LLM, SGLang, TGI) is the most important anti-fragmentation decision you can make.

Fix 3: Keep allocations uniform and pre-plan memory

Fragmentation thrives on variability, so reduce it:

Pad or bucket inputs. Grouping requests into a few sequence-length buckets makes allocations uniform and reusable, instead of a unique size per request.
Pre-allocate a memory fraction. Many engines let you set the share of VRAM reserved for the KV cache (e.g., vLLM's gpu_memory_utilization). Reserving a planned pool up front prevents ad-hoc allocations from fragmenting memory.
Use static shapes where possible. Fixed batch/sequence shapes (and CUDA graphs) avoid re-allocations entirely for the steady state.
Watch dtype and quantization. Lower-precision weights and KV cache (FP8/INT8) shrink allocations, leaving more headroom so fragmentation is less likely to push you over the edge.

Fix 4: Diagnose before you guess

Don't fight phantom fragmentation — measure it:

nvidia-smi shows total used vs. Free at the process level.
PyTorch memory tools — torch.cuda.memory_summary(), memory_allocated() vs. memory_reserved() (a big gap between reserved and allocated signals fragmentation), and the memory snapshot/visualizer that records allocation history and shows exactly where the gaps are.
Engine metrics — vLLM and others expose KV-cache utilization and the number of running/waiting sequences, telling you whether you are memory-bound.

If memory_reserved is much larger than memory_allocated and you still OOM, fragmentation — not raw capacity — is your problem, and expandable_segments plus paged KV cache are the fixes.

Putting it together

To avoid GPU memory fragmentation: serve LLMs with a paged-KV-cache engine (vLLM/TensorRT-LLM/SGLang/TGI); enable PyTorch's expandable_segments and tune max_split_size_mb for variable workloads; pre-allocate a planned memory fraction rather than letting allocations grow ad hoc; bucket and pad inputs to keep allocation sizes uniform; lower precision with quantization to widen your headroom; and measure reserved-vs-allocated and KV utilization so you fix the real bottleneck.

Done together, these keep free VRAM contiguous and reusable, letting you pack more work onto each GPU without mysterious out-of-memory failures.

Frequently Asked Questions

Why do I get "CUDA out of memory" when nvidia-smi shows free memory? Because the free memory is fragmented — split into non-contiguous holes too small for the allocation you need, even though the total is sufficient. The framework's caching allocator may also be holding reserved memory it has not handed back.

Enabling PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True, reducing allocation-size variability, and using a paged KV cache usually resolves it.

What is the difference between internal and external fragmentation? Internal fragmentation is wasted space *inside* an allocation — you reserved more than you use, often because the request was rounded up to a block size or pre-sized for the worst case. External fragmentation is wasted space *between* allocations — free memory exists but is scattered in non-contiguous chunks, so a large contiguous request fails.

LLM KV caches can suffer from both.

How does PagedAttention reduce fragmentation? PagedAttention stores the KV cache in fixed-size, non-contiguous pages allocated on demand, like operating-system virtual memory. This removes the need to reserve a large contiguous block sized for the maximum sequence length per request, eliminating internal fragmentation, and makes freed pages instantly reusable by any request, eliminating external fragmentation.

The result is far higher concurrency on the same GPU.

Does calling torch.cuda.empty_cache() fix fragmentation? Only partially, and it can hurt performance if overused. empty_cache() returns cached-but-unused blocks to the driver, which can free up contiguous space, but the framework then has to re-acquire memory, adding latency.

Use it sparingly between distinct workload phases. For ongoing variable workloads, expandable_segments and paged KV caching are better, structural fixes.

How do I tell if fragmentation, not capacity, is my problem? Compare torch.cuda.memory_reserved() (what the allocator holds) with torch.cuda.memory_allocated() (what is actually in use). A large gap, combined with OOM errors, indicates fragmentation rather than a genuine shortage of memory.

PyTorch's memory snapshot/visualizer shows the allocation timeline and gaps, and engine KV-cache utilization metrics reveal whether the serving layer is the cause.

Does quantization help with fragmentation? Indirectly, yes. Quantizing weights and the KV cache (for example to FP8 or INT8) shrinks each allocation, so you have more free headroom and are less likely to hit the edge where fragmentation causes an OOM. It does not fix the structural cause of fragmentation, but it buys margin and lets you serve more concurrent requests, which is often the practical goal.

Sources

NVIDIA — "CUDA memory management and best practices" (docs.nvidia.com/cuda)
PyTorch — "CUDA semantics: memory management and caching allocator" (pytorch.org/docs)
PyTorch — "Understanding GPU memory and the memory snapshot/visualizer" (pytorch.org/blog)
VLLM — "PagedAttention and KV cache management" (docs.vllm.ai)
Kwon et al. — "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023)
Hugging Face — "Text Generation Inference memory and paged attention" (huggingface.co/docs/text-generation-inference)
NVIDIA — "TensorRT-LLM KV cache and paged attention" (github.com/NVIDIA/TensorRT-LLM)

Keep reading

![GPU memory fragmentation in AI workloads](https://image.pollinations.ai/prompt/GPU%20memory%20fragmentation%20VRAM%20allocation%20blocks%20KV%20cache%20paged%20attention%20optimization%20diagram%20glowing%20purple?width=1280&height=720&nologo=true)

# What is GPU memory fragmentation and how do you avoid it?

### Direct Answer
GPU memory fragmentation is wasted VRAM that you cannot use even though the total free amount looks sufficient. It happens because GPU memory is allocated and freed in differently-sized blocks, leaving **gaps** between live allocations. There are two flavors: **external fragmentation**, where free memory is split into many small non-contiguous chunks so a large allocation fails despite enough total free bytes (the classic `CUDA out of memory: tried to allocate X, but Y free` error), and **internal fragmentation**, where memory you reserved is only partly used (a request rounded up to a larger block, or a pre-allocated buffer sized for the worst case). In LLM serving this is acute because the **KV cache** grows unpredictably with each request's sequence length. You avoid it with the right allocator and serving design: use a **caching/pooled allocator** (PyTorch's caching allocator with `expandable_segments`), use **PagedAttention** (vLLM) so the KV cache is allocated in fixed pages instead of one big contiguous block, set sensible **memory pre-allocation fractions**, avoid wildly variable allocation sizes, and periodically empty the cache only when truly necessary. The goal is to keep allocations uniform and pooled so free memory stays usable rather than shattered into unusable gaps.

## What fragmentation actually is

When a program allocates GPU memory, the allocator carves a region out of VRAM. When it frees that region, the space becomes available again — but only as a hole of a specific size in a specific location. Allocate and free many differently-sized tensors over time and the free memory becomes a patchwork of holes.

- **External fragmentation:** the free holes are scattered and non-contiguous. You have, say, 4 GB free total, but no single contiguous 2 GB block, so a 2 GB allocation fails. This is why you can see "out of memory" with gigabytes "free."
- **Internal fragmentation:** memory is reserved but underused. Allocators often round requests up to a block-size boundary or to a power of two, so a 1.1 GB request might consume a 2 GB block, with 0.9 GB locked but unused. Pre-allocating a fixed-size buffer for the largest possible input wastes memory on smaller inputs the same way.

```mermaid
flowchart TD
    V[Total VRAM] --> A[Allocate A]
    A --> B[Allocate B]
    B --> FA[Free A leaves a hole]
    FA --> C[Allocate large C]
    C --> X{Fits in any single hole?}
    X -->|No, only scattered gaps| OOM[OOM despite free bytes]
    X -->|Yes| OK[Success]
```

## Why LLM serving makes it worse

For training with fixed batch and sequence sizes, allocations are fairly uniform and fragmentation is manageable. **LLM inference is different** because the **KV cache** — the per-request memory holding attention keys and values — grows as the model generates tokens, and every request has a different, unpredictable length.

A naive server reserves a **contiguous** KV-cache block sized for the maximum sequence length for every request. That causes massive **internal** fragmentation (most requests are far shorter than the max) and **external** fragmentation (requests start and finish at different times, leaving variable holes). Studies behind vLLM found naive KV-cache management wasted a large fraction of GPU memory this way, directly limiting how many concurrent requests a GPU could serve.

```mermaid
flowchart LR
    R1[Req 1 short] --> KV[KV cache pool]
    R2[Req 2 long] --> KV
    R3[Req 3 medium] --> KV
    KV --> PAG[PagedAttention: fixed pages, no contiguous reservation]
    PAG --> EFF[Near-zero KV waste, more concurrency]
```

## Fix 1: Use a caching/pooled allocator

Frameworks do not call the driver for every tensor — they use a **caching allocator** that reserves big slabs of VRAM and sub-allocates from a pool, reusing freed blocks instead of returning them to the driver. This already reduces fragmentation, but you can tune it:

- **PyTorch caching allocator with `expandable_segments`:** setting `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` lets segments grow and shrink, dramatically reducing external fragmentation for variable workloads. This is one of the single most effective fixes for "OOM with free memory."
- **Tune `max_split_size_mb`:** limiting how the allocator splits large blocks can prevent it from chopping a big free region into unusable fragments.
- **Avoid manual `empty_cache()` in hot paths:** calling it constantly returns memory to the driver only to re-acquire it, hurting performance; use it sparingly, e.g., between distinct phases.


[![CRO Syndicate — Need a fractional Chief Revenue Officer? CRO Syndicate connects you with vetted fractional and interim revenue leaders. Kory White, Fractional CRO · 25 yrs · $0 to $200M scaled.](https://wsrv.nl/?url=files.catbox.moe/usgv65.png&w=1280&output=webp)](https://calendly.com/korywhiterevops)

**Reach Kory White, Fractional CRO:** [📅 Book a Quick Call](https://calendly.com/korywhiterevops) · [💼 Kory on LinkedIn](https://www.linkedin.com/in/korywhite) · [🏢 CRO Syndicate](https://crosyndicate.com/)

## Fix 2: Use PagedAttention for KV cache

The biggest win for LLM serving is **PagedAttention**, introduced by **vLLM** and now adopted broadly. Inspired by virtual memory paging in operating systems, it stores the KV cache in **fixed-size pages** that need not be contiguous, instead of one large contiguous reservation per request:

- Pages are allocated on demand as a sequence grows, so short requests use few pages and long ones use more — eliminating the internal fragmentation of reserving for the max length.
- Because pages are uniform and non-contiguous, freed pages are immediately reusable by any request — eliminating external fragmentation.
- This lets a GPU pack far more concurrent sequences into the same VRAM, raising throughput substantially.

If you serve LLMs, using an engine with paged KV cache (**vLLM**, **TensorRT-LLM**, **SGLang**, **TGI**) is the most important anti-fragmentation decision you can make.

## Fix 3: Keep allocations uniform and pre-plan memory

Fragmentation thrives on variability, so reduce it:

- **Pad or bucket inputs.** Grouping requests into a few sequence-length buckets makes allocations uniform and reusable, instead of a unique size per request.
- **Pre-allocate a memory fraction.** Many engines let you set the share of VRAM reserved for the KV cache (e.g., vLLM's `gpu_memory_utilization`). Reserving a planned pool up front prevents ad-hoc allocations from fragmenting memory.
- **Use static shapes where possible.** Fixed batch/sequence shapes (and CUDA graphs) avoid re-allocations entirely for the steady state.
- **Watch dtype and quantization.** Lower-precision weights and KV cache (FP8/INT8) shrink allocations, leaving more headroom so fragmentation is less likely to push you over the edge.

## Fix 4: Diagnose before you guess

Don't fight phantom fragmentation — measure it:

- **`nvidia-smi`** shows total used vs. Free at the process level.
- **PyTorch memory tools** — `torch.cuda.memory_summary()`, `memory_allocated()` vs. `memory_reserved()` (a big gap between reserved and allocated signals fragmentation), and the **memory snapshot/visualizer** that records allocation history and shows exactly where the gaps are.
- **Engine metrics** — vLLM and others expose KV-cache utilization and the number of running/waiting sequences, telling you whether you are memory-bound.

If `memory_reserved` is much larger than `memory_allocated` and you still OOM, fragmentation — not raw capacity — is your problem, and `expandable_segments` plus paged KV cache are the fixes.

## Putting it together

To avoid GPU memory fragmentation: serve LLMs with a **paged-KV-cache engine** (vLLM/TensorRT-LLM/SGLang/TGI); enable PyTorch's **`expandable_segments`** and tune `max_split_size_mb` for variable workloads; **pre-allocate** a planned memory fraction rather than letting allocations grow ad hoc; **bucket and pad** inputs to keep allocation sizes uniform; lower precision with **quantization** to widen your headroom; and **measure** reserved-vs-allocated and KV utilization so you fix the real bottleneck. Done together, these keep free VRAM contiguous and reusable, letting you pack more work onto each GPU without mysterious out-of-memory failures.

## Frequently Asked Questions

**Why do I get "CUDA out of memory" when nvidia-smi shows free memory?**
Because the free memory is fragmented — split into non-contiguous holes too small for the allocation you need, even though the total is sufficient. The framework's caching allocator may also be holding reserved memory it has not handed back. Enabling `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`, reducing allocation-size variability, and using a paged KV cache usually resolves it.

**What is the difference between internal and external fragmentation?**
Internal fragmentation is wasted space *inside* an allocation — you reserved more than you use, often because the request was rounded up to a block size or pre-sized for the worst case. External fragmentation is wasted space *between* allocations — free memory exists but is scattered in non-contiguous chunks, so a large contiguous request fails. LLM KV caches can suffer from both.

**How does PagedAttention reduce fragmentation?**
PagedAttention stores the KV cache in fixed-size, non-contiguous pages allocated on demand, like operating-system virtual memory. This removes the need to reserve a large contiguous block sized for the maximum sequence length per request, eliminating internal fragmentation, and makes freed pages instantly reusable by any request, eliminating external fragmentation. The result is far higher concurrency on the same GPU.

**Does calling torch.cuda.empty_cache() fix fragmentation?**
Only partially, and it can hurt performance if overused. `empty_cache()` returns cached-but-unused blocks to the driver, which can free up contiguous space, but the framework then has to re-acquire memory, adding latency. Use it sparingly between distinct workload phases. For ongoing variable workloads, `expandable_segments` and paged KV caching are better, structural fixes.

**How do I tell if fragmentation, not capacity, is my problem?**
Compare `torch.cuda.memory_reserved()` (what the allocator holds) with `torch.cuda.memory_allocated()` (what is actually in use). A large gap, combined with OOM errors, indicates fragmentation rather than a genuine shortage of memory. PyTorch's memory snapshot/visualizer shows the allocation timeline and gaps, and engine KV-cache utilization metrics reveal whether the serving layer is the cause.

**Does quantization help with fragmentation?**
Indirectly, yes. Quantizing weights and the KV cache (for example to FP8 or INT8) shrinks each allocation, so you have more free headroom and are less likely to hit the edge where fragmentation causes an OOM. It does not fix the structural cause of fragmentation, but it buys margin and lets you serve more concurrent requests, which is often the practical goal.

## Sources
- NVIDIA — "CUDA memory management and best practices" (docs.nvidia.com/cuda)
- PyTorch — "CUDA semantics: memory management and caching allocator" (pytorch.org/docs)
- PyTorch — "Understanding GPU memory and the memory snapshot/visualizer" (pytorch.org/blog)
- VLLM — "PagedAttention and KV cache management" (docs.vllm.ai)
- Kwon et al. — "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023)
- Hugging Face — "Text Generation Inference memory and paged attention" (huggingface.co/docs/text-generation-inference)
- NVIDIA — "TensorRT-LLM KV cache and paged attention" (github.com/NVIDIA/TensorRT-LLM)

Was this helpful?

Related in the library

KnowledgeHow do you design a disaster recovery plan for AI services?Read →KnowledgeThe 10 Best AI Observability Tools for RAG Pipelines in 2027Read →KnowledgeWhat are the biggest hidden costs in running AI infrastructure?Read →KnowledgeThe 10 Best Foundation Model API Providers in 2027Read →KnowledgeHow do you measure and improve GPU utilization?Read →KnowledgeThe 10 Best Data Warehouses for Machine Learning in 2027Read →KnowledgeWhat is the role of Kubernetes in modern AI infrastructure?Read →KnowledgeThe 10 Best AI Inference Accelerators in 2027Read →KnowledgeHow do you handle model rollbacks safely in production?Read →KnowledgeThe 10 Best Open-Source LLMs for Self-Hosting in 2027Read →

What is GPU memory fragmentation and how do you avoid it?

What is GPU memory fragmentation and how do you avoid it?

Direct Answer

What fragmentation actually is

Why LLM serving makes it worse

Fix 1: Use a caching/pooled allocator

Fix 2: Use PagedAttention for KV cache

Fix 3: Keep allocations uniform and pre-plan memory

Fix 4: Diagnose before you guess

Putting it together

Frequently Asked Questions

Sources

What does the score mean?