How do you version datasets and models for reproducibility?

Question

Pulse RevOps · The Machine · Accepted Answer

![versioning datasets and models for reproducibility](https://image.pollinations.ai/prompt/dataset%20and%20model%20versioning%20reproducibility%20git%20lineage%20experiment%20tracking%20machine%20learning%20pipeline%20glowing%20orange%20diagram?width=1280&height=720&nologo=true)

# How do you version datasets and models for reproducibility?

### Direct Answer
You version datasets and models for reproducibility by treating **every input and output of training as an immutable, identified artifact** and recording the exact links between them. In practice that means: version your **code** in Git; version your **data** with a tool like DVC, LakeFS, Pachyderm, or a Delta/Iceberg lakehouse that gives each dataset state a content hash or commit; track every **training run** with an experiment tracker (MLflow, Weights & Biases) that captures parameters, metrics, code commit, and data version; and register the resulting **model** in a model registry with its full lineage. The goal is that for any model in production you can answer "exactly which code, which data, and which configuration produced this?" and re-run it to get the same result. Pinning dependencies and random seeds, plus capturing the environment (Docker, lockfiles), closes the last gaps so a result is truly reproducible rather than merely tracked.

## Why reproducibility is hard in ML

In normal software, versioning code in Git is usually enough — same code, same output. ML breaks that assumption because a model is a function of **three** things that all change independently: the **code**, the **data**, and the **configuration/environment** (hyperparameters, library versions, hardware, random seeds). Change any one and the model changes. Worse, datasets are often large binary files that don't belong in Git, and training has nondeterministic elements (GPU operations, shuffling) that can vary run to run.

So reproducibility requires versioning all three dimensions and recording how they combine for each run. Skipping any one leaves a gap where "it worked last month" becomes unrecoverable.

```mermaid
flowchart LR
    Code[Code - Git commit] --> Run[Training run]
    Data[Data - DVC/LakeFS version] --> Run
    Config[Config + env - params, seeds, Docker] --> Run
    Run --> Track[Experiment tracker logs all three]
    Track --> Model[Model artifact + version]
    Model --> Reg[Model registry with lineage]
```

## Versioning datasets

Datasets are usually too large for Git, so you version them with purpose-built tools that store the heavy bytes in object storage (S3, GCS) while keeping lightweight, versioned pointers next to your code:

- **DVC (Data Version Control)** is the most popular open-source approach. It works alongside Git: you `dvc add` a dataset, DVC stores its content hash in a small `.dvc` file you commit to Git, and the actual data lives in remote storage. Checking out an old Git commit and running `dvc checkout` restores the exact dataset that went with it.
- **LakeFS** brings **Git-like branching, commits, and merges to object storage**, so you can snapshot a whole data lake atomically and reproduce the exact state used by any run.
- **Pachyderm** versions data and provides data-driven pipelines with lineage, automatically re-running stages when inputs change.
- **Delta Lake** and **Apache Iceberg** are lakehouse table formats with **time travel** — every write creates a new immutable version, so you can query a table "as of" a specific version or timestamp, which is a clean way to pin training data.
- **Hugging Face Datasets** versions public and private datasets with commit hashes for sharing and reuse.

The principle is the same across tools: each dataset state has an **immutable identifier** (content hash, commit, or version number) that you record with the run, so the exact bytes can be recovered later.

[![CRO Syndicate — Need a fractional Chief Revenue Officer? CRO Syndicate connects you with vetted fractional and interim revenue leaders. Kory White, Fractional CRO · 25 yrs · $0 to $200M scaled.](https://wsrv.nl/?url=files.catbox.moe/usgv65.png&w=1280&output=webp)](https://calendly.com/korywhiterevops)

**Reach Kory White, Fractional CRO:** [📅 Book a Quick Call](https://calendly.com/korywhiterevops) · [💼 Kory on LinkedIn](https://www.linkedin.com/in/korywhite) · [🏢 CRO Syndicate](https://crosyndicate.com/)

## Tracking experiments and runs

Versioning artifacts isn't enough; you must capture the *act of training* that connects them. **Experiment trackers** do this:

- **MLflow Tracking** logs parameters, metrics, the code version, the dataset version/reference, and output artifacts for every run, and pairs with the **MLflow Model Registry** for stage promotion and lineage.
- **Weights & Biases (W&B)** logs runs, metrics, and config, and its **Artifacts** feature versions datasets and models with a lineage graph linking them to the runs that used and produced them.
- **Neptune** and **Comet** offer similar run-and-artifact tracking

How do you version datasets and models for reproducibility?

How do you version datasets and models for reproducibility?

Direct Answer

Why reproducibility is hard in ML

Versioning datasets

Tracking experiments and runs

Versioning models and their lineage

Pinning the environment and randomness

Putting it together: a reproducible workflow

Frequently Asked Questions

Sources

How do you version datasets and models for reproducibility?

How do you version datasets and models for reproducibility?

Direct Answer

Why reproducibility is hard in ML

Versioning datasets

Tracking experiments and runs

Versioning models and their lineage

Pinning the environment and randomness

Putting it together: a reproducible workflow

Frequently Asked Questions

Sources

What does the score mean?