How do you version LLM models, prompts, and eval sets in production in 2027?

Question

Pulse RevOps · The Machine · Accepted Answer

### Direct Answer

In 2027, **production LLM model versioning** spans three artifacts: (1) **the model itself** (vendor-managed for API models; MLflow + Hugging Face Hub for self-hosted), (2) **the prompt and system message** (Git-versioned alongside code; Promptfoo or LangSmith for review), and (3) **the eval set + golden answers** (Git-versioned; refreshed quarterly). Production deployments **pin specific model versions** (Claude `claude-opus-4-7-20260115`, not `claude-opus-latest`) and explicitly version every prompt change. Treat prompts as code — they need PRs, reviews, evals, and rollback.

## 1. Model Version Pinning

**Never use `latest` aliases in production.** Vendors push silent model updates that change behavior. Pin specific versions:

- **Anthropic:** `claude-opus-4-7-20260115` (date-stamped).
- **OpenAI:** `gpt-5-2026-01-15` (date-stamped); newer models often have explicit version strings.
- **Google:** `gemini-pro-2.5-001` (numbered).
- **Self-hosted Hugging Face models:** pin to specific commit SHA on the Hub.

### 1.1 Vendor Model Deprecation Cadence

Anthropic deprecates old Claude versions ~12 months after a new generation; OpenAI ~18 months; Google ~9 months. Build a **model-migration playbook** with quarterly review.

## 2. Prompt Versioning

Treat prompts as code:
- **Git repo for prompts** alongside application code.
- **PR review** for every prompt change.
- **Eval-set run on PR** via Promptfoo, LangSmith, or Braintrust.
- **Tagged releases** matched to deployment versions.

### 2.1 Prompt Management Platforms

- **Promptfoo** — Git-first; strong eval integration.
- **LangSmith Prompt Hub** — UI for prompt iteration; version tracking.
- **Braintrust Prompts** — UI + Git sync.
- **Helicone Prompts** — proxy-managed.
- **Humanloop** — collaborative prompt iteration.

### 2.2 The Risk of UI-Managed Prompts

UI-managed prompts without Git backing become **shadow prompts** — no PR review, no eval-on-change, no rollback. The 2027 best practice: **Git is the source of truth; UI is a viewer.**

## 3. Eval Set Versioning

Eval sets evolve. Tag every release of your golden eval set so you can compare model A on eval-set-v3 to model B on eval-set-v3.

- **Git for the eval set** in repo alongside code.
- **Versioned snapshots** when adding examples.
- **Stratified sampling** for incremental additions.
- **Quarterly refresh** with stakeholder review.

## 4. Rollback Strategy

Every model change, prompt change, or eval-set change needs a **rollback plan**:
- **Canary deployment** — 5% of traffic on new version; monitor metrics; roll back if regression.
- **Feature flags** — LaunchDarkly, Statsig, GrowthBook gate new model versions.
- **A/B testing** — Statsig, Eppo, Optimizely for systematic comparison.

### 4.1 Production Telemetry for Rollback Decisions

Track per-version:
- Latency P50/P95/P99.
- Cost per call.
- Eval-in-production score (LLM-as-judge).
- User-feedback signal (thumbs, follow-up rate).
- Error rate.

Roll back if any metric regresses >5% with statistical significance.

```mermaid
flowchart TD
    A[Prompt or Model Change PR] --> B[Eval-on-CI Promptfoo or LangSmith]
    B --> C{Pass Eval?}
    C -->|No| D[Reject PR]
    C -->|Yes| E[Tagged Release]
    E --> F[Canary Deploy 5 Percent Traffic]
    F --> G[Production Telemetry Datadog LangSmith]
    G --> H{Regression?}
    H -->|Yes| I[Rollback to Previous Version]
    H -->|No| J[Scale to 100 Percent]
    I --> K[Triage Issue + Fix]
    J --> L[Quarterly Review]
    K --> A
```

## 5. The Three-Artifact Versioning Matrix

| Artifact | Storage | Versioning | Review |
|---|---|---|---|
| Model | Vendor API (pinned) or HF Hub | Date-stamped version string | Quarterly bake-off |
| Prompt | Git repo | Semver tag | PR with eval-on-CI |
| Eval Set | Git repo | Semver tag + dated snapshots | Quarterly stakeholder |

```mermaid
flowchart LR
    M[Model Version Pinned] --> D[Production Deploy]
    P[Prompt Version Git Tag] --> D
    E[Eval Set Version Git Tag] --> CI[Eval-on-CI]
    CI --> D
    D --> T[Telemetry + Eval-in-Production]
    T --> R{Drift?}
    R -->|Yes| RB[Rollback]
    R -->|No| OK[Quarterly Review]
```

## FAQ

**Should we ever use `latest` model aliases?** Never in production. Pin versions.

**Where do prompts live — Git or a UI tool?** Git as source of truth; UI as viewer. UI-only is shadow code.

**How often should we refresh the eval set?** Quarterly minimum; sooner if production distribution shifts.

**Canary or A/B test for new model versions?** Canary for rollback safety; A/B for measurable comparison. Many teams do both.

**What's the rollback trigger?** >5% regression on any tracked metric (latency, cost, eval score, user feedback) with statistical significance.

## Bottom Line

LLM versioning in 2027 is three artifacts — model, prompt, eval set — each version-controlled, eval-gated, and canary-deployed. Pin model versions. Treat prompts as code. Refresh eval sets quarterly. Build rollback into every deployment.

Artifact	Storage	Versioning	Review
Model	Vendor API (pinned) or HF Hub	Date-stamped version string	Quarterly bake-off
Prompt	Git repo	Semver tag	PR with eval-on-CI
Eval Set	Git repo	Semver tag + dated snapshots	Quarterly stakeholder

How do you version LLM models, prompts, and eval sets in production in 2027?

Direct Answer

1. Model Version Pinning

1.1 Vendor Model Deprecation Cadence

2. Prompt Versioning

2.1 Prompt Management Platforms

2.2 The Risk of UI-Managed Prompts

3. Eval Set Versioning

4. Rollback Strategy

4.1 Production Telemetry for Rollback Decisions

5. The Three-Artifact Versioning Matrix

FAQ

Bottom Line

Sources

What does the score mean?