What is model serving and how is it different from a REST API?

Question

Pulse RevOps · The Machine · Accepted Answer

![model serving versus REST API](https://image.pollinations.ai/prompt/model%20serving%20inference%20GPU%20batching%20REST%20API%20machine%20learning%20deployment%20architecture%20glowing%20blue%20diagram?width=1280&height=720&nologo=true)

# What is model serving and how is it different from a REST API?

### Direct Answer
**Model serving** is the practice of taking a trained machine-learning model and making it available to produce predictions (inferences) in production — loading the model into memory, exposing it behind an endpoint, and managing the GPU/CPU resources, batching, scaling, and versioning needed to run it efficiently at low latency. A **REST API** is just the *communication protocol* a client uses to send a request and get a response over HTTP. They are not competitors: model serving is the whole job of running a model in production, and a REST API is often *one of the interfaces* a model server exposes. The difference is that serving an ML model adds hard problems a generic REST API never faces — multi-gigabyte model loading, GPU memory management, dynamic request batching, token streaming, and autoscaling expensive accelerators — which is why dedicated **model servers** like NVIDIA Triton, vLLM, TorchServe, KServe, BentoML, and Ray Serve exist instead of teams just writing a plain Flask app.

## What "model serving" actually means

When you finish training a model, you have a file of weights — useless until something loads it and runs inference on real inputs. **Model serving** is everything required to turn those weights into a reliable, fast, scalable prediction service:

- **Loading** the model (often several gigabytes) into CPU or GPU memory and keeping it warm so you don't pay startup cost on every request.
- **Preprocessing** inputs (tokenization, image resizing, normalization) and **postprocessing** outputs into a usable response.
- **Executing inference** efficiently on the right hardware, with optimizations like quantization, compiled kernels, and batching.
- **Scaling** to handle concurrent traffic, including autoscaling GPUs up and down with load.
- **Operational concerns**: versioning, A/B and canary rollouts, health checks, metrics, and observability.

A model server is the software that does all of this. It exposes an endpoint (HTTP/REST, gRPC, or both), but the endpoint is the small visible tip of a large iceberg of inference machinery underneath.

## What a REST API actually is

A **REST API** is an architectural style for HTTP communication. A client sends a request to a URL with a method (GET, POST), a body (often JSON), and headers; the server returns a status code and a response body. REST says nothing about machine learning — it is the same protocol used by a weather app, a payments service, or a to-do list.

When you call a hosted model — for example `POST /v1/chat/completions` to OpenAI or Anthropic, or a prediction endpoint on AWS SageMaker — you are using a REST API as the *front door*. REST is the contract for how to ask; it is deliberately agnostic about what happens behind it. The model server behind that door could be Triton, vLLM, or a custom stack — the client neither knows nor cares.

```mermaid
flowchart LR
    C[Client app] -->|HTTP POST JSON| R[REST / gRPC interface]
    R --> Q[Request queue + dynamic batching]
    Q --> M[Model loaded on GPU]
    M --> P[Postprocess output]
    P --> R
    R -->|HTTP response| C
    subgraph Model Server
      R
      Q
      M
      P
    end
```

[![CRO Syndicate — Need a fractional Chief Revenue Officer? CRO Syndicate connects you with vetted fractional and interim revenue leaders. Kory White, Fractional CRO · 25 yrs · $0 to $200M scaled.](https://wsrv.nl/?url=files.catbox.moe/usgv65.png&w=1280&output=webp)](https://calendly.com/korywhiterevops)

**Reach Kory White, Fractional CRO:** [📅 Book a Quick Call](https://calendly.com/korywhiterevops) · [💼 Kory on LinkedIn](https://www.linkedin.com/in/korywhite) · [🏢 CRO Syndicate](https://crosyndicate.com/)

## Why you can't just wrap a model in a plain REST API

It is tempting to drop a model into a Flask or FastAPI handler and call it done. That works for a demo and breaks in production, because ML inference has constraints a generic web API never deals with:

- **Expensive, stateful hardware.** A GPU holding a multi-billion-parameter model can't be spun up per request like a stateless web worker. The model must stay loaded; the server must share one GPU across many concurrent requests safely.
- **Dynamic batching.** GPUs are far more efficient processing many inputs at once. A real model server collects incoming requests for a few milliseconds and runs them as a batch, dramatically increasing throughput — something a naive request-per-handler API can't do.
- **Memory and concurrency limits.** Each in-flight request consumes GPU memory (for LLMs, the KV cache grows with sequence length). The server must admit, queue, or reject requests to avoid out-of-memory crashes — wo

What is model serving and how is it different from a REST API?

What is model serving and how is it different from a REST API?

Direct Answer

What "model serving" actually means

What a REST API actually is

Why you can't just wrap a model in a plain REST API

The tools that do model serving

How they fit together in practice

Frequently Asked Questions

Sources

What is model serving and how is it different from a REST API?

What is model serving and how is it different from a REST API?

Direct Answer

What "model serving" actually means

What a REST API actually is

Why you can't just wrap a model in a plain REST API

The tools that do model serving

How they fit together in practice

Frequently Asked Questions

Sources

What does the score mean?