DVC: version data & models

Git is brilliant for code and terrible for data. Commit a 2 GB dataset or a 500 MB model and your repo becomes unusable — Git stores every version of every byte. Yet you need to know which data and which model produced a given result (it's required for reproducibility, debugging, and audits). DVC (Data Version Control) solves this: it versions large files with Git, without putting them in Git.

Setup: pip install dvc (add dvc[s3]/dvc[gs] for cloud remotes). Follow-along — DVC commands shown with expected output.

The core trick

DVC keeps your big files in separate storage (a "remote": S3, GCS, even a shared folder) and commits a tiny pointer file to Git instead. The pointer holds a hash of the data; Git versions the pointer, DVC versions the data:

Git repo:   data.csv.dvc   (a 100-byte text file with a hash)   ✅ tiny, versioned
DVC remote: the actual 2 GB data.csv, keyed by that hash        ✅ big, out of Git

So git checkout an old commit, then dvc checkout, and you get exactly the data and model that commit used — full reproducibility, no bloated repo.

Versioning a dataset

cd code
dvc init                          # set up DVC in the repo (once)
dvc add data/reviews.csv          # tell DVC to track this file
git add data/reviews.csv.dvc data/.gitignore
git commit -m "track reviews dataset v1"

Expected output of dvc add:

100% Adding...|████████████████████████████████|1/1 [00:00, 12.4file/s]

To track the changes with git, run:
    git add data/reviews.csv.dvc data/.gitignore

DVC moved the real CSV into its cache, wrote reviews.csv.dvc (the pointer), and added the CSV to .gitignore so Git never tries to track the big file. You commit the pointer, not the data.

Don't be confused: dvc add vs. git add. git add stages a file into Git. dvc add does the opposite — it pulls a file out of Git's tracking and into DVC's, leaving behind a small .dvc pointer that Git does track. You then git add the pointer. Two tools, clear division: Git tracks small text (code, pointers), DVC tracks big binaries (data, models).

Pushing data to a remote

The data lives in your local DVC cache until you push it to shared storage:

dvc remote add -d storage s3://my-bucket/dvc-store    # configure once
dvc push                                              # upload tracked data

Now a teammate runs git pull (gets the pointers) then dvc pull (downloads the matching data). Everyone has the same data, versioned in lockstep with the code.

Reproducible pipelines (dvc.yaml)

DVC's second superpower: define your pipeline as stages with declared dependencies and outputs, and DVC only re-runs a stage when its inputs change:

# dvc.yaml
stages:
  train:
    cmd: python sentiment/train.py
    deps:
      - sentiment/train.py
      - data/reviews.csv          # if the data changes, retrain
    outs:
      - model.json                 # DVC versions the output model too
    metrics:
      - metrics.json               # track metrics across versions
dvc repro          # runs only the stages whose dependencies changed
dvc metrics show   # compare metrics across data/code versions

dvc repro is "make for ML": change the data and it knows to retrain; change nothing and it does nothing. This makes your whole train→evaluate flow reproducible and cached.

Time-travel to any experiment

Because data, model, and metrics are all versioned alongside the code, you can jump to any past state completely:

git checkout <old-commit>     # the code + the .dvc pointers from back then
dvc checkout                  # the exact data + model that commit used

You're now looking at precisely the data and model that produced last month's result — the thing that's impossible with Git alone, and essential when a regulator or a postmortem asks "what exactly did we train on?"

DVC vs. the alternatives

  • Git LFS — also stores big files outside Git, but it's generic and lacks DVC's ML pipelines, metrics, and experiment features.
  • lakeFS / Delta Lake — data versioning at the data-lake scale (whole tables, branching petabytes). Bigger hammer.
  • MLflow — versions models and runs; DVC versions data and pipelines. They're complementary, and teams often use both.

Don't be confused: DVC vs. MLflow. They overlap but lead with different things. MLflow centers on experiment tracking and the model registry (params, metrics, model versions). DVC centers on data versioning and reproducible pipelines (which data + which code = which result). Use MLflow to compare runs and ship models; use DVC to guarantee you can reproduce the data behind any of them.

The takeaway

DVC versions large data and models with Git but not in it: the big files live in a remote (S3/GCS), and Git tracks tiny .dvc pointer files, so git checkout + dvc checkout reproduces any past data-and-model state exactly. dvc.yaml turns your pipeline into cached, dependency-aware stages ("make for ML"). Now that data and models are versioned, let's automate the loop that produces them — orchestrated retraining. 👉