← Hub
Pulse ← Library ⚡ Hire a Fractional CRO
Pulse Reviews and Analysis

How do you build a self-hosted LLM stack in 2027?

Kory WhiteCurated by Kory White · Fractional CRO, CRO Syndicate
👍 Yup or 👎 Nope — vote this up its category:
📅 Published · Updated · 6 min read
How do you build a self-hosted LLM stack in 2027?

How do you build a self-hosted LLM stack in 2027?

You build a self-hosted LLM stack by assembling five layers on your own hardware or cloud GPUs: an open-weight model (such as a Llama, Mistral, Qwen, or Gemma family model), an inference server to serve it efficiently (vLLM, TGI, or TensorRT-LLM), a retrieval layer for your data (an embedding model plus a vector database), a gateway for routing, caching, auth, and observability, and an orchestration and monitoring layer (Kubernetes or Ray plus Prometheus/Grafana and an LLM observability tool).

The goal is data control, predictable cost, and no per-token API dependency. The right stack scales from a single GPU box for a small team to an autoscaled cluster for thousands of users.

Why self-host at all

Teams self-host LLMs for three main reasons: data privacy and control (sensitive data never leaves your environment), cost predictability (you pay for GPUs, not per token, which favors high, steady volume), and customization (you can fine-tune, quantize, and tune serving to your needs).

The trade-off is operational responsibility — you run the infrastructure that a hosted API would run for you. Self-hosting wins when you have steady volume, strict data requirements, or a need to customize models; hosted APIs win when traffic is low or spiky and you value zero operations.

flowchart TD A[Decide to self-host] --> B{Why?} B -- Data control --> C[Keep data in your environment] B -- Cost at scale --> D[GPUs over per-token API] B -- Customization --> E[Fine-tune + quantize freely] C --> F[Build the stack] D --> F E --> F

Layer 1: Choose an open-weight model

Start by selecting a model that fits your task and hardware. Open-weight families in 2027 include Llama, Mistral / Mixtral, Qwen, Gemma, and DeepSeek, spanning small models that run on a single GPU to large mixture-of-experts models that need multiple GPUs. Pick the smallest model that meets your quality bar — smaller models are cheaper and faster to serve, and you can escalate to a larger one only for hard tasks via routing.

Evaluate candidate models on your own task, not just public benchmarks.

Layer 2: Serve it with an efficient inference server

Raw model weights are not a service; you need an inference server. The leading open-source choices are vLLM (high throughput via PagedAttention and continuous batching, OpenAI-compatible API), Hugging Face TGI, and NVIDIA TensorRT-LLM with Triton for maximum NVIDIA performance.

For local development, Ollama and llama.cpp run quantized models on modest hardware. Apply quantization (GPTQ, AWQ, FP8, or GGUF) to fit larger models on cheaper GPUs and raise throughput.

flowchart LR A[Open-weight model] --> B[Quantize: GPTQ/AWQ/FP8] B --> C{Serving target} C -- Production GPU --> D[vLLM / TGI / TensorRT-LLM] C -- Local / edge --> E[Ollama / llama.cpp] D --> F[OpenAI-compatible endpoint] E --> F
CRO Syndicate — Need a fractional Chief Revenue Officer? CRO Syndicate connects you with vetted fractional and interim revenue leaders. Kory White, Fractional CRO · 25 yrs · $0 to $200M scaled.

Reach Kory White, Fractional CRO: 📅 Book a Quick Call · 💼 Kory on LinkedIn · 🏢 CRO Syndicate

Layer 3: Add retrieval for your data (RAG)

Most useful self-hosted assistants need access to your documents, which means retrieval-augmented generation. This adds two components: an embedding model (open options like the BGE or E5 families, or Nomic embeddings) to turn text into vectors, and a vector database to store and search them.

For self-hosting, Qdrant, Weaviate, Milvus, or pgvector (if you already run Postgres) are strong choices. The retrieval layer chunks your documents, embeds them, and at query time fetches relevant context to ground the model's answers — reducing hallucination and keeping responses current without retraining.

Layer 4: Put a gateway in front

As soon as more than one application calls your models, add an AI gateway for routing, caching, authentication, rate limiting, and observability. Open-source options like LiteLLM give you an OpenAI-compatible proxy in front of your self-hosted models (and any external APIs you still use), with semantic caching to cut redundant work and per-team quotas to control usage.

The gateway is also where you enforce guardrails — PII redaction and content filtering — before requests reach the model.

Layer 5: Orchestrate, scale, and monitor

For anything beyond a single box, you need orchestration and observability:

A reference architecture

Putting the layers together, a production self-hosted stack looks like: clients call an AI gateway (LiteLLM) that handles auth, rate limiting, caching, and routing; the gateway sends generation requests to vLLM replicas serving a quantized open-weight model on GPUs, and retrieval requests to an embedding service + vector database (Qdrant or pgvector); Kubernetes or Ray autoscales the replicas on queue depth; and Prometheus/Grafana plus Langfuse monitor health, quality, and cost.

For a small team this can run on a single multi-GPU server; for thousands of users it spreads across an autoscaled cluster with the same components.

Frequently Asked Questions

What hardware do I need to start? For a small team, a single server with one or more modern GPUs can serve a quantized mid-size open model comfortably. The exact GPU memory you need depends on the model size and quantization; quantizing to INT4 lets larger models fit on cheaper GPUs. Scale to more GPUs and nodes as traffic grows.

Which open-weight model should I pick? Choose the smallest model that meets your quality bar on your own task — common families include Llama, Mistral/Mixtral, Qwen, Gemma, and DeepSeek. Use a small model for most traffic and route hard requests to a larger one. Always evaluate on your data rather than relying on public benchmarks.

Do I need a vector database? If your assistant must answer from your documents, yes — RAG needs an embedding model and a vector database (Qdrant, Weaviate, Milvus, or pgvector). If the model only does open-ended generation without your private data, you can skip retrieval.

How do I keep serving cost down? Use a continuous-batching server (vLLM), quantize models to fit cheaper GPUs, cache repeated and rephrased queries, route simple requests to smaller models, and autoscale on queue depth so idle GPUs spin down. These stack to cut cost substantially.

Is self-hosting cheaper than using an API? At high, steady volume, self-hosting is usually cheaper per token because you pay for GPUs rather than per call. At low or spiky volume, hosted APIs are often cheaper and far less operational work. Estimate your token volume and utilization before deciding.

What about security and guardrails? Enforce auth and rate limiting at the gateway, redact PII and filter content before requests reach the model, keep model and data inside your network, and manage secrets properly. Self-hosting gives you full control of the data path, which is a major reason teams choose it.

Sources

Keep reading
Was this helpful?  
Related in the library
More from the library
pulse-speeches · speechesA Speech for a Hall of Fame Inductionpulse-speeches · speechesA Retirement Speech for a Pastorpulse-speeches · speechesA Eulogy for a Coworker Who Died Youngpulse-speeches · speechesA Graduation Speech for a College Commencementpulse-ai-infrastructure · ai-infrastructureWhat is the difference between vLLM, TGI, and Triton for LLM inference?pulse-speeches · speechesA Eulogy for a Fatherpulse-ai-infrastructure · ai-infrastructureThe 10 Best AI Model Monitoring Tools in 2027pulse-ai-infrastructure · ai-infrastructureThe 10 Best Data Labeling Platforms for AI in 2027pulse-ai-infrastructure · ai-infrastructureWhat is model serving and how is it different from a REST API?pulse-speeches · speechesA Graduation Speech for a Nursing School Pinningpulse-ai-infrastructure · ai-infrastructureHow do you optimize cold-start latency for serverless AI inference?pulse-speeches · speechesA Speech for Welcoming a New Hirepulse-ai-infrastructure · ai-infrastructureWhat is distributed training and when do you need it?