How do you handle GPU scheduling on Kubernetes for AI workloads?

Question

Pulse RevOps · The Machine · Accepted Answer

![How do you handle GPU scheduling on Kubernetes for AI workloads?](https://cloudrps.com/images/blog/kubernetes-dra-gpu-ai-workloads/hero.png)

# How do you handle GPU scheduling on Kubernetes for AI workloads?

You handle GPU scheduling on Kubernetes by adding the capabilities the default scheduler lacks: first expose GPUs to the cluster with the **NVIDIA GPU Operator** and device plugin, then layer a **batch scheduler or queue** (Kueue, Volcano, or NVIDIA's KAI Scheduler) that adds fair-share quotas, priorities, preemption, and gang scheduling for distributed jobs. Pair that with a **node autoscaler** (Karpenter or Cluster Autoscaler) to provision GPU nodes on demand and scale to zero when idle, and enable **GPU sharing** (MIG or time-slicing) so multiple workloads can use one accelerator. The goal is high utilization of scarce, expensive GPUs without jobs deadlocking or one team starving another.

## Why the default Kubernetes scheduler isn't enough

Kubernetes treats a GPU as a simple countable resource (`nvidia.com/gpu: 1`) and schedules pods one at a time, first-fit, with no awareness of how AI workloads actually behave. It has no concept of **gang scheduling** (starting all workers of a distributed training job together), **fair-share quotas** across teams, **fractional GPU sharing**, or **queue-based admission**. The result, without help, is classic failure modes: a multi-pod training job gets some pods scheduled and others stuck pending — holding GPUs idle in a deadlock — while a single team's jobs monopolize the cluster and others wait indefinitely. Fixing this is what GPU scheduling on Kubernetes is about.

```mermaid
flowchart TD
    A[Default K8s scheduler] --> B[GPU = countable resource only]
    B --> C[No gang scheduling]
    B --> D[No fair-share quotas]
    B --> E[No fractional sharing]
    C --> F[Distributed jobs deadlock]
    D --> G[One team starves others]
    E --> H[Low utilization]
```

## Step 1: Expose GPUs with the GPU Operator

Before scheduling anything, the cluster must *see* GPUs. The **NVIDIA GPU Operator** automates the full stack — GPU drivers, the Kubernetes device plugin, the container toolkit, node feature discovery, DCGM monitoring, and MIG configuration — across every GPU node. Once installed, GPU nodes advertise `nvidia.com/gpu` (or MIG profiles), and the device plugin handles allocation and isolation. This is the universal foundation every other layer builds on, and it also gives you the utilization telemetry (via DCGM) you need to tune everything else.

## Step 2: Add a batch scheduler or queue

This is the core of GPU scheduling. A queueing/batch layer decides *when* jobs run, enforces quotas, and starts distributed jobs atomically:

- **Kueue** — a Kubernetes-SIG project that adds job queueing, quotas, cohort-based fair sharing, priorities, and preemption. It admits jobs only when resources are available, so pods don't get stuck half-scheduled. It is the clean, native, open-source default.
- **Volcano** — a CNCF batch scheduler with strong **gang scheduling**, queues, fair-share, and topology-aware placement for high-performance interconnects. Ideal when distributed training must have all workers start together.
- **NVIDIA KAI Scheduler** — open-sourced from Run:ai technology, bringing AI-aware scheduling (gang, fair-share, fractional GPUs, bin-packing) to open-source users.

Pick one based on your needs: Kueue for clean quota/queue management, Volcano or KAI when gang scheduling and HPC-style placement dominate.

```mermaid
flowchart LR
    A[Job submitted] --> B[Queue: Kueue/Volcano/KAI]
    B --> C{Quota + capacity available?}
    C -->|No| D[Job waits in queue]
    C -->|Yes| E[Admit + gang-schedule all pods]
    E --> F[Pods bound to GPUs]
    D -->|Capacity frees / preempt| E
```

[![CRO Syndicate — Need a fractional Chief Revenue Officer? CRO Syndicate connects you with vetted fractional and interim revenue leaders. Kory White, Fractional CRO · 25 yrs · $0 to $200M scaled.](https://wsrv.nl/?url=files.catbox.moe/usgv65.png&w=1280&output=webp)](https://calendly.com/korywhiterevops)

**Reach Kory White, Fractional CRO:** [📅 Book a Quick Call](https://calendly.com/korywhiterevops) · [💼 Kory on LinkedIn](https://www.linkedin.com/in/korywhite) · [🏢 CRO Syndicate](https://crosyndicate.com/)

## Step 3: Set quotas, priorities, and preemption

With a queue in place, model your organization's GPU policy. Define **quotas** per team or namespace so each gets a guaranteed slice but can **burst** into unused capacity (cohorts/fair-share reclaim idle quota for whoever needs it). Assign **priorities** so latency-sensitive inference outranks long batch training, and enable **preemption** so high-priority jobs can reclaim GPUs from lower-priority, preemptible ones (with checkpointing so preempted training resumes cleanly). This policy layer is what turns a shared GPU cluster from a free-for-all into a fair, predictable resource.

## Step 4: Autoscale GPU nodes

GPUs a

How do you handle GPU scheduling on Kubernetes for AI workloads?

How do you handle GPU scheduling on Kubernetes for AI workloads?

Why the default Kubernetes scheduler isn't enough

Step 1: Expose GPUs with the GPU Operator

Step 2: Add a batch scheduler or queue

Step 3: Set quotas, priorities, and preemption

Step 4: Autoscale GPU nodes

Putting it together and measuring success

Frequently Asked Questions

Sources

How do you handle GPU scheduling on Kubernetes for AI workloads?

How do you handle GPU scheduling on Kubernetes for AI workloads?

Why the default Kubernetes scheduler isn't enough

Step 1: Expose GPUs with the GPU Operator

Step 2: Add a batch scheduler or queue

Step 3: Set quotas, priorities, and preemption

Step 4: Autoscale GPU nodes

Step 5: Share GPUs to raise utilization

Putting it together and measuring success

Frequently Asked Questions

Sources

What does the score mean?