← Hub
Pulse ← Library ⚡ Hire a Fractional CRO
Pulse Reviews and Analysis

How do you version datasets and models for reproducibility?

Kory WhiteCurated by Kory White · Fractional CRO, CRO Syndicate
👍 Yup or 👎 Nope — vote this up its category:
📅 Published · 7 min read
versioning datasets and models for reproducibility

How do you version datasets and models for reproducibility?

Direct Answer

You version datasets and models for reproducibility by treating every input and output of training as an immutable, identified artifact and recording the exact links between them. In practice that means: version your code in Git; version your data with a tool like DVC, LakeFS, Pachyderm, or a Delta/Iceberg lakehouse that gives each dataset state a content hash or commit; track every training run with an experiment tracker (MLflow, Weights & Biases) that captures parameters, metrics, code commit, and data version; and register the resulting model in a model registry with its full lineage.

The goal is that for any model in production you can answer "exactly which code, which data, and which configuration produced this?" and re-run it to get the same result. Pinning dependencies and random seeds, plus capturing the environment (Docker, lockfiles), closes the last gaps so a result is truly reproducible rather than merely tracked.

Why reproducibility is hard in ML

In normal software, versioning code in Git is usually enough — same code, same output. ML breaks that assumption because a model is a function of three things that all change independently: the code, the data, and the configuration/environment (hyperparameters, library versions, hardware, random seeds).

Change any one and the model changes. Worse, datasets are often large binary files that don't belong in Git, and training has nondeterministic elements (GPU operations, shuffling) that can vary run to run.

So reproducibility requires versioning all three dimensions and recording how they combine for each run. Skipping any one leaves a gap where "it worked last month" becomes unrecoverable.

flowchart LR Code[Code - Git commit] --> Run[Training run] Data[Data - DVC/LakeFS version] --> Run Config[Config + env - params, seeds, Docker] --> Run Run --> Track[Experiment tracker logs all three] Track --> Model[Model artifact + version] Model --> Reg[Model registry with lineage]

Versioning datasets

Datasets are usually too large for Git, so you version them with purpose-built tools that store the heavy bytes in object storage (S3, GCS) while keeping lightweight, versioned pointers next to your code:

The principle is the same across tools: each dataset state has an immutable identifier (content hash, commit, or version number) that you record with the run, so the exact bytes can be recovered later.

CRO Syndicate — Need a fractional Chief Revenue Officer? CRO Syndicate connects you with vetted fractional and interim revenue leaders. Kory White, Fractional CRO · 25 yrs · $0 to $200M scaled.

Reach Kory White, Fractional CRO: 📅 Book a Quick Call · 💼 Kory on LinkedIn · 🏢 CRO Syndicate

Tracking experiments and runs

Versioning artifacts isn't enough; you must capture the *act of training* that connects them. Experiment trackers do this:

A good tracker turns reproducibility into a lookup: open the run, see the exact code commit, data version, hyperparameters, environment, and metrics — everything needed to reconstruct it.

Versioning models and their lineage

A trained model is an artifact too, and it needs its own versioned home with provenance. A model registry stores each model version alongside the metadata that produced it:

The non-negotiable is lineage: from a deployed model, you should be able to trace back to its training run, its dataset version, and its code commit. That chain is what lets you audit, debug, reproduce, or roll back a model with confidence.

flowchart TD A[Deployed model v7] --> B[Training run #1423] B --> C[Code commit a1b9f2] B --> D[Dataset version d3 - hash 9f8e] B --> E[Params + seed + Docker image] C --> F[Reproduce identical model] D --> F E --> F

Pinning the environment and randomness

Even with code, data, and config versioned, two more things can break reproducibility:

Capturing these alongside the run closes the last gaps between "tracked" and "truly reproducible."

Putting it together: a reproducible workflow

A practical, end-to-end setup looks like this: store code in Git; version datasets with DVC (or Delta/Iceberg time travel for warehouse data); wrap training in a pipeline (e.g., orchestrated by Dagster or Airflow) that pins the Docker environment and seeds; log every run to MLflow or W&B, capturing parameters, metrics, the Git commit, and the data version; and register the output in a model registry with full lineage.

With that in place, reproducing any model is mechanical: check out its commit, restore its data version, rebuild its environment, and re-run — and you'll land in the same place.

This discipline pays off well beyond reproducibility: it enables audits and compliance, fast rollback when a new model misbehaves, fair comparison between experiments, and confident collaboration across a team.

Frequently Asked Questions

Why can't I just store datasets in Git? Git is built for text and small files; large binary datasets bloat the repo and make it slow or unusable. Tools like DVC and LakeFS keep the heavy data in object storage while committing only a small, versioned pointer (a content hash) to Git, giving you Git-like history without the bloat.

Git LFS exists but scales poorly for large ML datasets.

What's the difference between experiment tracking and data versioning? Data versioning gives each dataset state an immutable identifier you can recover later. Experiment tracking records each training run — its parameters, metrics, code commit, and which data version it used. You need both: versioning preserves the artifacts, while tracking captures how they were combined to produce a specific model.

How do model registries help reproducibility? A model registry stores each model version with metadata linking it back to the exact run, code, and data that produced it, plus deployment stage. That lineage lets you audit how a production model was built, reproduce or retrain it, compare versions, and roll back instantly to a known-good version if a new one regresses.

Do I really need to control random seeds? For strict reproducibility, yes — set and log seeds for your ML framework, NumPy, and Python so shuffling and initialization are repeatable. Be aware some GPU operations remain nondeterministic; frameworks offer deterministic modes (at a performance cost) if you need bit-for-bit results, otherwise small variance is normal and usually acceptable.

What is data lineage and why does it matter? Data lineage is the recorded trail of where data came from and how it was transformed into the features and datasets used for training. It matters for debugging (tracing a bad prediction to a bad input), compliance (proving what data trained a model), and reproducibility (recovering the exact upstream state).

Pachyderm, lakehouse catalogs, and trackers like W&B capture lineage.

Can I get reproducibility on a cloud ML platform without separate tools? Largely yes. Platforms like Databricks (Delta time travel + MLflow), SageMaker (Feature Store time-travel + Model Registry + Pipelines), and Vertex AI (Datasets + Experiments + Model Registry) bundle data versioning, experiment tracking, and model registry so the pieces are integrated.

You still need to pin environments and seeds, but the artifact-versioning backbone is built in.

Sources

Keep reading
Was this helpful?  
Related in the library
More from the library
pulse-ai-infrastructure · ai-infrastructureWhat is a semantic cache and how much can it cut inference costs?pulse-speeches · speechesHow to Quote Someone Without Sounding Clichepulse-ai-infrastructure · ai-infrastructureHow do you load-test an LLM inference service?pulse-speeches · speechesA Speech for a Team Offsite Kickoffpulse-speeches · speechesA Speech for a Merger Town Hallpulse-speeches · speechesA Speech for a Library Reopeningpulse-ai-infrastructure · ai-infrastructureThe 10 Best Real-Time ML Feature Platforms in 2027pulse-ai-infrastructure · ai-infrastructureWhat is a vector index and how do HNSW and IVF differ?pulse-ai-infrastructure · ai-infrastructureWhat is the best way to cache embeddings at scale?pulse-ai-infrastructure · ai-infrastructureHow do you optimize cold-start latency for serverless AI inference?pulse-speeches · speechesA Retirement Speech for a Teacherpulse-ai-infrastructure · ai-infrastructureThe 10 Best AI Observability Platforms in 2027pulse-ai-infrastructure · ai-infrastructureThe 10 Best Feature Stores for Machine Learning in 2027