How do you version LLM models, prompts, and eval sets in production in 2027?
Direct Answer
In 2027, production LLM model versioning spans three artifacts: (1) the model itself (vendor-managed for API models; MLflow + Hugging Face Hub for self-hosted), (2) the prompt and system message (Git-versioned alongside code; Promptfoo or LangSmith for review), and (3) the eval set + golden answers (Git-versioned; refreshed quarterly).
Production deployments pin specific model versions (Claude claude-opus-4-7-20260115, not claude-opus-latest) and explicitly version every prompt change. Treat prompts as code — they need PRs, reviews, evals, and rollback.
1. Model Version Pinning
Never use latest aliases in production. Vendors push silent model updates that change behavior. Pin specific versions:
- Anthropic:
claude-opus-4-7-20260115(date-stamped). - OpenAI:
gpt-5-2026-01-15(date-stamped); newer models often have explicit version strings. - Google:
gemini-pro-2.5-001(numbered). - Self-hosted Hugging Face models: pin to specific commit SHA on the Hub.
1.1 Vendor Model Deprecation Cadence
Anthropic deprecates old Claude versions ~12 months after a new generation; OpenAI ~18 months; Google ~9 months. Build a model-migration playbook with quarterly review.
2. Prompt Versioning
Treat prompts as code:
- Git repo for prompts alongside application code.
- PR review for every prompt change.
- Eval-set run on PR via Promptfoo, LangSmith, or Braintrust.
- Tagged releases matched to deployment versions.
2.1 Prompt Management Platforms
- Promptfoo — Git-first; strong eval integration.
- LangSmith Prompt Hub — UI for prompt iteration; version tracking.
- Braintrust Prompts — UI + Git sync.
- Helicone Prompts — proxy-managed.
- Humanloop — collaborative prompt iteration.
2.2 The Risk of UI-Managed Prompts
UI-managed prompts without Git backing become shadow prompts — no PR review, no eval-on-change, no rollback. The 2027 best practice: Git is the source of truth; UI is a viewer.
3. Eval Set Versioning
Eval sets evolve. Tag every release of your golden eval set so you can compare model A on eval-set-v3 to model B on eval-set-v3.
- Git for the eval set in repo alongside code.
- Versioned snapshots when adding examples.
- Stratified sampling for incremental additions.
- Quarterly refresh with stakeholder review.
4. Rollback Strategy
Every model change, prompt change, or eval-set change needs a rollback plan:
- Canary deployment — 5% of traffic on new version; monitor metrics; roll back if regression.
- Feature flags — LaunchDarkly, Statsig, GrowthBook gate new model versions.
- A/B testing — Statsig, Eppo, Optimizely for systematic comparison.
4.1 Production Telemetry for Rollback Decisions
Track per-version:
- Latency P50/P95/P99.
- Cost per call.
- Eval-in-production score (LLM-as-judge).
- User-feedback signal (thumbs, follow-up rate).
- Error rate.
Roll back if any metric regresses >5% with statistical significance.
5. The Three-Artifact Versioning Matrix
| Artifact | Storage | Versioning | Review |
|---|---|---|---|
| Model | Vendor API (pinned) or HF Hub | Date-stamped version string | Quarterly bake-off |
| Prompt | Git repo | Semver tag | PR with eval-on-CI |
| Eval Set | Git repo | Semver tag + dated snapshots | Quarterly stakeholder |
FAQ
Should we ever use latest model aliases? Never in production. Pin versions.
Where do prompts live — Git or a UI tool? Git as source of truth; UI as viewer. UI-only is shadow code.
How often should we refresh the eval set? Quarterly minimum; sooner if production distribution shifts.
Canary or A/B test for new model versions? Canary for rollback safety; A/B for measurable comparison. Many teams do both.
What's the rollback trigger? >5% regression on any tracked metric (latency, cost, eval score, user feedback) with statistical significance.
Bottom Line
LLM versioning in 2027 is three artifacts — model, prompt, eval set — each version-controlled, eval-gated, and canary-deployed. Pin model versions. Treat prompts as code. Refresh eval sets quarterly. Build rollback into every deployment. The teams that skip versioning rediscover the same regression bug every quarter.
Sources
- Anthropic — Claude API Versioning Documentation
- OpenAI — Model Versioning and Deprecation Documentation
- Google — Gemini Model Versioning Reference
- Hugging Face — Hub Model Versioning Reference
- Promptfoo — Git-First Prompt Management Documentation
- LangChain — LangSmith Prompt Hub Reference
- Braintrust — Prompt Versioning Reference
- Statsig — Feature Flags and Experimentation Reference
- LaunchDarkly — AI Configurations Reference
- ESG — LLM Production Operations Survey (2026)