How do you build a self-hosted LLM stack in 2027?

Question

Pulse RevOps · The Machine · Accepted Answer

![How do you build a self-hosted LLM stack in 2027?](https://onyx.app/images/insights/06-stack-layers.png)

# How do you build a self-hosted LLM stack in 2027?

You build a self-hosted LLM stack by assembling five layers on your own hardware or cloud GPUs: an **open-weight model** (such as a Llama, Mistral, Qwen, or Gemma family model), an **inference server** to serve it efficiently (vLLM, TGI, or TensorRT-LLM), a **retrieval layer** for your data (an embedding model plus a vector database), a **gateway** for routing, caching, auth, and observability, and an **orchestration and monitoring** layer (Kubernetes or Ray plus Prometheus/Grafana and an LLM observability tool). The goal is data control, predictable cost, and no per-token API dependency. The right stack scales from a single GPU box for a small team to an autoscaled cluster for thousands of users.

## Why self-host at all

Teams self-host LLMs for three main reasons: **data privacy and control** (sensitive data never leaves your environment), **cost predictability** (you pay for GPUs, not per token, which favors high, steady volume), and **customization** (you can fine-tune, quantize, and tune serving to your needs). The trade-off is operational responsibility — you run the infrastructure that a hosted API would run for you. Self-hosting wins when you have steady volume, strict data requirements, or a need to customize models; hosted APIs win when traffic is low or spiky and you value zero operations.

```mermaid
flowchart TD
    A[Decide to self-host] --> B{Why?}
    B -- Data control --> C[Keep data in your environment]
    B -- Cost at scale --> D[GPUs over per-token API]
    B -- Customization --> E[Fine-tune + quantize freely]
    C --> F[Build the stack]
    D --> F
    E --> F
```

## Layer 1: Choose an open-weight model

Start by selecting a model that fits your task and hardware. Open-weight families in 2027 include **Llama**, **Mistral / Mixtral**, **Qwen**, **Gemma**, and **DeepSeek**, spanning small models that run on a single GPU to large mixture-of-experts models that need multiple GPUs. Pick the smallest model that meets your quality bar — smaller models are cheaper and faster to serve, and you can escalate to a larger one only for hard tasks via routing. Evaluate candidate models on your own task, not just public benchmarks.

## Layer 2: Serve it with an efficient inference server

Raw model weights are not a service; you need an inference server. The leading open-source choices are **vLLM** (high throughput via PagedAttention and continuous batching, OpenAI-compatible API), **Hugging Face TGI**, and **NVIDIA TensorRT-LLM** with **Triton** for maximum NVIDIA performance. For local development, **Ollama** and **llama.cpp** run quantized models on modest hardware. Apply **quantization** (GPTQ, AWQ, FP8, or GGUF) to fit larger models on cheaper GPUs and raise throughput.

```mermaid
flowchart LR
    A[Open-weight model] --> B[Quantize: GPTQ/AWQ/FP8]
    B --> C{Serving target}
    C -- Production GPU --> D[vLLM / TGI / TensorRT-LLM]
    C -- Local / edge --> E[Ollama / llama.cpp]
    D --> F[OpenAI-compatible endpoint]
    E --> F
```

[![CRO Syndicate — Need a fractional Chief Revenue Officer? CRO Syndicate connects you with vetted fractional and interim revenue leaders. Kory White, Fractional CRO · 25 yrs · $0 to $200M scaled.](https://wsrv.nl/?url=files.catbox.moe/usgv65.png&w=1280&output=webp)](https://calendly.com/korywhiterevops)

**Reach Kory White, Fractional CRO:** [📅 Book a Quick Call](https://calendly.com/korywhiterevops) · [💼 Kory on LinkedIn](https://www.linkedin.com/in/korywhite) · [🏢 CRO Syndicate](https://crosyndicate.com/)

## Layer 3: Add retrieval for your data (RAG)

Most useful self-hosted assistants need access to your documents, which means **retrieval-augmented generation**. This adds two components: an **embedding model** (open options like the BGE or E5 families, or Nomic embeddings) to turn text into vectors, and a **vector database** to store and search them. For self-hosting, **Qdrant**, **Weaviate**, **Milvus**, or **pgvector** (if you already run Postgres) are strong choices. The retrieval layer chunks your documents, embeds them, and at query time fetches relevant context to ground the model's answers — reducing hallucination and keeping responses current without retraining.

## Layer 4: Put a gateway in front

As soon as more than one application calls your models, add an **AI gateway** for routing, caching, authentication, rate limiting, and observability. Open-source options like **LiteLLM** give you an OpenAI-compatible proxy in front of your self-hosted models (and any external APIs you still use), with semantic caching to cut redundant work and per-team quotas to control usage. The gateway is also where you enforce **guardrails** — PII redaction and content filtering — before requests reach the model.

## Layer 5: Orchestrate, scale, and monitor

For anything beyond a single box, you need orches

How do you build a self-hosted LLM stack in 2027?

How do you build a self-hosted LLM stack in 2027?

Why self-host at all

Layer 1: Choose an open-weight model

Layer 2: Serve it with an efficient inference server

Layer 3: Add retrieval for your data (RAG)

Layer 4: Put a gateway in front

Layer 5: Orchestrate, scale, and monitor

A reference architecture

Frequently Asked Questions

Sources

How do you build a self-hosted LLM stack in 2027?

How do you build a self-hosted LLM stack in 2027?

Why self-host at all

Layer 1: Choose an open-weight model

Layer 2: Serve it with an efficient inference server

Layer 3: Add retrieval for your data (RAG)

Layer 4: Put a gateway in front

Layer 5: Orchestrate, scale, and monitor

A reference architecture

Frequently Asked Questions

Sources

What does the score mean?