How do you version datasets and models for reproducibility?
How do you version datasets and models for reproducibility?
Direct Answer
You version datasets and models for reproducibility by treating every input and output of training as an immutable, identified artifact and recording the exact links between them. In practice that means: version your code in Git; version your data with a tool like DVC, LakeFS, Pachyderm, or a Delta/Iceberg lakehouse that gives each dataset state a content hash or commit; track every training run with an experiment tracker (MLflow, Weights & Biases) that captures parameters, metrics, code commit, and data version; and register the resulting model in a model registry with its full lineage.
The goal is that for any model in production you can answer "exactly which code, which data, and which configuration produced this?" and re-run it to get the same result. Pinning dependencies and random seeds, plus capturing the environment (Docker, lockfiles), closes the last gaps so a result is truly reproducible rather than merely tracked.
Why reproducibility is hard in ML
In normal software, versioning code in Git is usually enough — same code, same output. ML breaks that assumption because a model is a function of three things that all change independently: the code, the data, and the configuration/environment (hyperparameters, library versions, hardware, random seeds).
Change any one and the model changes. Worse, datasets are often large binary files that don't belong in Git, and training has nondeterministic elements (GPU operations, shuffling) that can vary run to run.
So reproducibility requires versioning all three dimensions and recording how they combine for each run. Skipping any one leaves a gap where "it worked last month" becomes unrecoverable.
Versioning datasets
Datasets are usually too large for Git, so you version them with purpose-built tools that store the heavy bytes in object storage (S3, GCS) while keeping lightweight, versioned pointers next to your code:
- DVC (Data Version Control) is the most popular open-source approach. It works alongside Git: you
dvc adda dataset, DVC stores its content hash in a small.dvcfile you commit to Git, and the actual data lives in remote storage. Checking out an old Git commit and runningdvc checkoutrestores the exact dataset that went with it. - LakeFS brings Git-like branching, commits, and merges to object storage, so you can snapshot a whole data lake atomically and reproduce the exact state used by any run.
- Pachyderm versions data and provides data-driven pipelines with lineage, automatically re-running stages when inputs change.
- Delta Lake and Apache Iceberg are lakehouse table formats with time travel — every write creates a new immutable version, so you can query a table "as of" a specific version or timestamp, which is a clean way to pin training data.
- Hugging Face Datasets versions public and private datasets with commit hashes for sharing and reuse.
The principle is the same across tools: each dataset state has an immutable identifier (content hash, commit, or version number) that you record with the run, so the exact bytes can be recovered later.

Reach Kory White, Fractional CRO: 📅 Book a Quick Call · 💼 Kory on LinkedIn · 🏢 CRO Syndicate
Tracking experiments and runs
Versioning artifacts isn't enough; you must capture the *act of training* that connects them. Experiment trackers do this:
- MLflow Tracking logs parameters, metrics, the code version, the dataset version/reference, and output artifacts for every run, and pairs with the MLflow Model Registry for stage promotion and lineage.
- Weights & Biases (W&B) logs runs, metrics, and config, and its Artifacts feature versions datasets and models with a lineage graph linking them to the runs that used and produced them.
- Neptune and Comet offer similar run-and-artifact tracking with rich comparison UIs.
A good tracker turns reproducibility into a lookup: open the run, see the exact code commit, data version, hyperparameters, environment, and metrics — everything needed to reconstruct it.
Versioning models and their lineage
A trained model is an artifact too, and it needs its own versioned home with provenance. A model registry stores each model version alongside the metadata that produced it:
- MLflow Model Registry versions models, tracks stages (Staging, Production, Archived), and links each version back to the run, code, and data that created it.
- W&B Model Registry, SageMaker Model Registry, and Vertex AI Model Registry provide cloud-native equivalents with approval workflows and deployment integration.
The non-negotiable is lineage: from a deployed model, you should be able to trace back to its training run, its dataset version, and its code commit. That chain is what lets you audit, debug, reproduce, or roll back a model with confidence.
Pinning the environment and randomness
Even with code, data, and config versioned, two more things can break reproducibility:
- Dependencies. Library versions affect results, so pin them with a lockfile (
requirements.txtwith hashes,poetry.lock,condaenv, oruv) and, ideally, capture the whole environment in a Docker image tagged per run. Reproducing then means pulling the exact image. - Randomness. Set and log random seeds for your framework, NumPy, and Python, and enable deterministic modes where available. Note that some GPU operations are inherently nondeterministic; for strict reproducibility you may need to accept tiny variance or enable framework deterministic flags, accepting a speed cost.
Capturing these alongside the run closes the last gaps between "tracked" and "truly reproducible."
Putting it together: a reproducible workflow
A practical, end-to-end setup looks like this: store code in Git; version datasets with DVC (or Delta/Iceberg time travel for warehouse data); wrap training in a pipeline (e.g., orchestrated by Dagster or Airflow) that pins the Docker environment and seeds; log every run to MLflow or W&B, capturing parameters, metrics, the Git commit, and the data version; and register the output in a model registry with full lineage.
With that in place, reproducing any model is mechanical: check out its commit, restore its data version, rebuild its environment, and re-run — and you'll land in the same place.
This discipline pays off well beyond reproducibility: it enables audits and compliance, fast rollback when a new model misbehaves, fair comparison between experiments, and confident collaboration across a team.
Frequently Asked Questions
Why can't I just store datasets in Git? Git is built for text and small files; large binary datasets bloat the repo and make it slow or unusable. Tools like DVC and LakeFS keep the heavy data in object storage while committing only a small, versioned pointer (a content hash) to Git, giving you Git-like history without the bloat.
Git LFS exists but scales poorly for large ML datasets.
What's the difference between experiment tracking and data versioning? Data versioning gives each dataset state an immutable identifier you can recover later. Experiment tracking records each training run — its parameters, metrics, code commit, and which data version it used. You need both: versioning preserves the artifacts, while tracking captures how they were combined to produce a specific model.
How do model registries help reproducibility? A model registry stores each model version with metadata linking it back to the exact run, code, and data that produced it, plus deployment stage. That lineage lets you audit how a production model was built, reproduce or retrain it, compare versions, and roll back instantly to a known-good version if a new one regresses.
Do I really need to control random seeds? For strict reproducibility, yes — set and log seeds for your ML framework, NumPy, and Python so shuffling and initialization are repeatable. Be aware some GPU operations remain nondeterministic; frameworks offer deterministic modes (at a performance cost) if you need bit-for-bit results, otherwise small variance is normal and usually acceptable.
What is data lineage and why does it matter? Data lineage is the recorded trail of where data came from and how it was transformed into the features and datasets used for training. It matters for debugging (tracing a bad prediction to a bad input), compliance (proving what data trained a model), and reproducibility (recovering the exact upstream state).
Pachyderm, lakehouse catalogs, and trackers like W&B capture lineage.
Can I get reproducibility on a cloud ML platform without separate tools? Largely yes. Platforms like Databricks (Delta time travel + MLflow), SageMaker (Feature Store time-travel + Model Registry + Pipelines), and Vertex AI (Datasets + Experiments + Model Registry) bundle data versioning, experiment tracking, and model registry so the pieces are integrated.
You still need to pin environments and seeds, but the artifact-versioning backbone is built in.
Sources
- DVC documentation — data and model versioning (dvc.org/doc)
- LakeFS documentation — Git-like versioning over object storage (docs.lakefs.io)
- Pachyderm documentation — data versioning and lineage (docs.pachyderm.com)
- Delta Lake and Apache Iceberg time-travel documentation (delta.io, iceberg.apache.org)
- MLflow documentation — Tracking and Model Registry (mlflow.org/docs)
- Weights & Biases documentation — Artifacts and Model Registry (docs.wandb.ai)
- Amazon SageMaker and Google Vertex AI model registry documentation (docs.aws.amazon.com, cloud.google.com/vertex-ai/docs)
- Hugging Face Datasets documentation — dataset versioning (huggingface.co/docs/datasets)
