What is model serving and how is it different from a REST API?
What is model serving and how is it different from a REST API?
Direct Answer
Model serving is the practice of taking a trained machine-learning model and making it available to produce predictions (inferences) in production — loading the model into memory, exposing it behind an endpoint, and managing the GPU/CPU resources, batching, scaling, and versioning needed to run it efficiently at low latency.
A REST API is just the *communication protocol* a client uses to send a request and get a response over HTTP. They are not competitors: model serving is the whole job of running a model in production, and a REST API is often *one of the interfaces* a model server exposes. The difference is that serving an ML model adds hard problems a generic REST API never faces — multi-gigabyte model loading, GPU memory management, dynamic request batching, token streaming, and autoscaling expensive accelerators — which is why dedicated model servers like NVIDIA Triton, vLLM, TorchServe, KServe, BentoML, and Ray Serve exist instead of teams just writing a plain Flask app.
What "model serving" actually means
When you finish training a model, you have a file of weights — useless until something loads it and runs inference on real inputs. Model serving is everything required to turn those weights into a reliable, fast, scalable prediction service:
- Loading the model (often several gigabytes) into CPU or GPU memory and keeping it warm so you don't pay startup cost on every request.
- Preprocessing inputs (tokenization, image resizing, normalization) and postprocessing outputs into a usable response.
- Executing inference efficiently on the right hardware, with optimizations like quantization, compiled kernels, and batching.
- Scaling to handle concurrent traffic, including autoscaling GPUs up and down with load.
- Operational concerns: versioning, A/B and canary rollouts, health checks, metrics, and observability.
A model server is the software that does all of this. It exposes an endpoint (HTTP/REST, gRPC, or both), but the endpoint is the small visible tip of a large iceberg of inference machinery underneath.
What a REST API actually is
A REST API is an architectural style for HTTP communication. A client sends a request to a URL with a method (GET, POST), a body (often JSON), and headers; the server returns a status code and a response body. REST says nothing about machine learning — it is the same protocol used by a weather app, a payments service, or a to-do list.
When you call a hosted model — for example POST /v1/chat/completions to OpenAI or Anthropic, or a prediction endpoint on AWS SageMaker — you are using a REST API as the *front door*. REST is the contract for how to ask; it is deliberately agnostic about what happens behind it. The model server behind that door could be Triton, vLLM, or a custom stack — the client neither knows nor cares.

Reach Kory White, Fractional CRO: 📅 Book a Quick Call · 💼 Kory on LinkedIn · 🏢 CRO Syndicate
Why you can't just wrap a model in a plain REST API
It is tempting to drop a model into a Flask or FastAPI handler and call it done. That works for a demo and breaks in production, because ML inference has constraints a generic web API never deals with:
- Expensive, stateful hardware. A GPU holding a multi-billion-parameter model can't be spun up per request like a stateless web worker. The model must stay loaded; the server must share one GPU across many concurrent requests safely.
- Dynamic batching. GPUs are far more efficient processing many inputs at once. A real model server collects incoming requests for a few milliseconds and runs them as a batch, dramatically increasing throughput — something a naive request-per-handler API can't do.
- Memory and concurrency limits. Each in-flight request consumes GPU memory (for LLMs, the KV cache grows with sequence length). The server must admit, queue, or reject requests to avoid out-of-memory crashes — work a plain REST handler ignores until it falls over.
- Streaming. LLMs generate token by token. Serving frameworks support server-sent events or streaming responses so users see output as it's produced; a simple request/response API returns only after the full generation finishes.
- Multi-model and versioning. Production systems run many models and versions concurrently, with canary rollouts and instant rollback — orchestration a hand-rolled API would have to reinvent.
This is exactly the gap that dedicated serving frameworks fill.
The tools that do model serving
- vLLM is the leading open-source engine for serving LLMs, famous for PagedAttention and continuous batching that maximize GPU throughput; it exposes an OpenAI-compatible REST API.
- NVIDIA Triton Inference Server serves models from any framework (PyTorch, TensorFlow, ONNX, TensorRT) with dynamic batching, concurrent model execution, and HTTP/gRPC endpoints.
- Hugging Face Text Generation Inference (TGI) is a production LLM server with streaming and tensor parallelism.
- TorchServe is PyTorch's native serving framework; TensorFlow Serving is the TensorFlow equivalent.
- BentoML packages models into portable "Bentos" and serves them with autoscaling, popular for its Python-first developer experience.
- KServe brings serverless, autoscaling model serving to Kubernetes with a standard inference protocol.
- Ray Serve offers scalable, framework-agnostic Python serving that composes multiple models into pipelines.
In every case, the framework handles the hard inference machinery and then *exposes* a REST (and usually gRPC) endpoint — confirming the relationship: serving is the engine, REST is the doorway.
How they fit together in practice
A typical production path looks like this: a client app makes a REST call (POST /generate with JSON) → a load balancer routes it → a model server (say vLLM) receives it, places it in a queue, and dynamically batches it with other in-flight requests → the GPU runs inference → tokens stream back through the REST/SSE response to the client.
The client only ever saw a REST API; everything that made the model fast, safe, and scalable was the serving layer behind it.
So when someone asks "should I use model serving or a REST API?" the framing is off. You use a model server to run the model well, and that server speaks REST (and often gRPC) so clients can reach it. Choosing a real serving framework over a hand-rolled Flask endpoint is the decision that actually matters for latency, throughput, and cost.
Frequently Asked Questions
Is a model-serving endpoint always a REST API? Not always. Many serving frameworks offer gRPC in addition to or instead of REST, because gRPC's binary protocol and streaming can be more efficient for high-throughput, low-latency inference. Triton, TorchServe, and others expose both.
REST is the most common public-facing choice because it's universally supported and easy to call.
Can I just use FastAPI to serve my model? For a low-traffic internal tool or prototype, yes. For production at scale — especially LLMs — you'll quickly need dynamic batching, GPU memory management, streaming, and autoscaling that a plain FastAPI handler doesn't provide. Teams often *wrap* a real serving engine (vLLM, Triton) with a thin FastAPI/gateway layer for custom auth or routing, but the heavy lifting stays in the serving engine.
What is dynamic batching and why does it matter? Dynamic batching collects multiple incoming requests over a short window and runs them through the GPU together. Because GPUs are massively parallel, processing a batch is far more efficient than one request at a time, so batching can multiply throughput several times over with only a small latency cost.
It's a core reason dedicated model servers outperform naive APIs.
How does model serving handle multiple models or versions? Serving frameworks support hosting many models and multiple versions of each simultaneously, with traffic-splitting for canary and A/B rollouts and instant rollback to a previous version. Triton's model repository and KServe's inference services are built around this, letting you update models without downtime.
Where does model serving run — my servers or a managed cloud? Both. You can self-host serving frameworks (vLLM, Triton, KServe) on your own GPUs or Kubernetes cluster, or use managed services like AWS SageMaker, Google Vertex AI, or Azure ML endpoints that run the serving layer for you and expose a REST endpoint.
Managed options trade some control for less operational overhead.
Does model serving apply to non-LLM models too? Yes. Model serving covers any model type — image classifiers, recommenders, fraud detectors, embedding models, classical ML. The LLM-specific concerns (KV cache, token streaming) are the most demanding, but batching, scaling, versioning, and a serving endpoint matter for every production model.
Triton and BentoML, for instance, serve all model types.
Sources
- NVIDIA Triton Inference Server documentation (docs.nvidia.com/deeplearning/triton-inference-server)
- VLLM documentation — PagedAttention and OpenAI-compatible server (docs.vllm.ai)
- Hugging Face Text Generation Inference documentation (huggingface.co/docs/text-generation-inference)
- PyTorch TorchServe documentation (pytorch.org/serve)
- KServe documentation — model inference on Kubernetes (kserve.github.io/website)
- BentoML documentation — model serving framework (docs.bentoml.com)
- Ray Serve documentation (docs.ray.io/en/latest/serve)
- MDN Web Docs — REST and HTTP fundamentals (developer.mozilla.org)
