What is GPU memory fragmentation and how do you avoid it?

Question

Pulse RevOps · The Machine · Accepted Answer

![GPU memory fragmentation in AI workloads](https://image.pollinations.ai/prompt/GPU%20memory%20fragmentation%20VRAM%20allocation%20blocks%20KV%20cache%20paged%20attention%20optimization%20diagram%20glowing%20purple?width=1280&height=720&nologo=true)

# What is GPU memory fragmentation and how do you avoid it?

### Direct Answer
GPU memory fragmentation is wasted VRAM that you cannot use even though the total free amount looks sufficient. It happens because GPU memory is allocated and freed in differently-sized blocks, leaving **gaps** between live allocations. There are two flavors: **external fragmentation**, where free memory is split into many small non-contiguous chunks so a large allocation fails despite enough total free bytes (the classic `CUDA out of memory: tried to allocate X, but Y free` error), and **internal fragmentation**, where memory you reserved is only partly used (a request rounded up to a larger block, or a pre-allocated buffer sized for the worst case). In LLM serving this is acute because the **KV cache** grows unpredictably with each request's sequence length. You avoid it with the right allocator and serving design: use a **caching/pooled allocator** (PyTorch's caching allocator with `expandable_segments`), use **PagedAttention** (vLLM) so the KV cache is allocated in fixed pages instead of one big contiguous block, set sensible **memory pre-allocation fractions**, avoid wildly variable allocation sizes, and periodically empty the cache only when truly necessary. The goal is to keep allocations uniform and pooled so free memory stays usable rather than shattered into unusable gaps.

## What fragmentation actually is

When a program allocates GPU memory, the allocator carves a region out of VRAM. When it frees that region, the space becomes available again — but only as a hole of a specific size in a specific location. Allocate and free many differently-sized tensors over time and the free memory becomes a patchwork of holes.

- **External fragmentation:** the free holes are scattered and non-contiguous. You have, say, 4 GB free total, but no single contiguous 2 GB block, so a 2 GB allocation fails. This is why you can see "out of memory" with gigabytes "free."
- **Internal fragmentation:** memory is reserved but underused. Allocators often round requests up to a block-size boundary or to a power of two, so a 1.1 GB request might consume a 2 GB block, with 0.9 GB locked but unused. Pre-allocating a fixed-size buffer for the largest possible input wastes memory on smaller inputs the same way.

```mermaid
flowchart TD
    V[Total VRAM] --> A[Allocate A]
    A --> B[Allocate B]
    B --> FA[Free A leaves a hole]
    FA --> C[Allocate large C]
    C --> X{Fits in any single hole?}
    X -->|No, only scattered gaps| OOM[OOM despite free bytes]
    X -->|Yes| OK[Success]
```

## Why LLM serving makes it worse

For training with fixed batch and sequence sizes, allocations are fairly uniform and fragmentation is manageable. **LLM inference is different** because the **KV cache** — the per-request memory holding attention keys and values — grows as the model generates tokens, and every request has a different, unpredictable length.

A naive server reserves a **contiguous** KV-cache block sized for the maximum sequence length for every request. That causes massive **internal** fragmentation (most requests are far shorter than the max) and **external** fragmentation (requests start and finish at different times, leaving variable holes). Studies behind vLLM found naive KV-cache management wasted a large fraction of GPU memory this way, directly limiting how many concurrent requests a GPU could serve.

```mermaid
flowchart LR
    R1[Req 1 short] --> KV[KV cache pool]
    R2[Req 2 long] --> KV
    R3[Req 3 medium] --> KV
    KV --> PAG[PagedAttention: fixed pages, no contiguous reservation]
    PAG --> EFF[Near-zero KV waste, more concurrency]
```

## Fix 1: Use a caching/pooled allocator

Frameworks do not call the driver for every tensor — they use a **caching allocator** that reserves big slabs of VRAM and sub-allocates from a pool, reusing freed blocks instead of returning them to the driver. This already reduces fragmentation, but you can tune it:

- **PyTorch caching allocator with `expandable_segments`:** setting `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` lets segments grow and shrink, dramatically reducing external fragmentation for variable workloads. This is one of the single most effective fixes for "OOM with free memory."
- **Tune `max_split_size_mb`:** limiting how the allocator splits large blocks can prevent it from chopping a big free region into unusable fragments.
- **Avoid manual `empty_cache()` in hot paths:** calling it constantly returns memory to the driver only to re-acquire it, hurting performance; use it sparingly, e.g., between distinct phases.

[![CRO Syndicate — Need a fractional Chief Revenue Officer? CRO Syndicate connects you with vetted fractional and int

What is GPU memory fragmentation and how do you avoid it?

What is GPU memory fragmentation and how do you avoid it?

Direct Answer

What fragmentation actually is

Why LLM serving makes it worse

Fix 1: Use a caching/pooled allocator

Fix 2: Use PagedAttention for KV cache

Fix 3: Keep allocations uniform and pre-plan memory

Fix 4: Diagnose before you guess

Putting it together

Frequently Asked Questions

Sources

What is GPU memory fragmentation and how do you avoid it?

What is GPU memory fragmentation and how do you avoid it?

Direct Answer

What fragmentation actually is

Why LLM serving makes it worse

Fix 1: Use a caching/pooled allocator

Fix 2: Use PagedAttention for KV cache

Fix 3: Keep allocations uniform and pre-plan memory

Fix 4: Diagnose before you guess

Putting it together

Frequently Asked Questions

Sources

What does the score mean?