DVC: version data & models
Git is brilliant for code and terrible for data. Commit a 2 GB dataset or a 500 MB model and your repo becomes unusable — Git stores every version of every byte. Yet you need to know which data and which model produced a given result (it's required for reproducibility, debugging, and audits). DVC (Data Version Control) solves this: it versions large files with Git, without putting them in Git.
Setup:
pip install dvc(adddvc[s3]/dvc[gs]for cloud remotes). Follow-along — DVC commands shown with expected output.
The core trick
DVC keeps your big files in separate storage (a "remote": S3, GCS, even a shared folder) and commits a tiny pointer file to Git instead. The pointer holds a hash of the data; Git versions the pointer, DVC versions the data:
Git repo: data.csv.dvc (a 100-byte text file with a hash) ✅ tiny, versioned
DVC remote: the actual 2 GB data.csv, keyed by that hash ✅ big, out of Git
So git checkout an old commit, then dvc checkout, and you get exactly the data
and model that commit used — full reproducibility, no bloated repo.
Versioning a dataset
cd code
dvc init # set up DVC in the repo (once)
dvc add data/reviews.csv # tell DVC to track this file
git add data/reviews.csv.dvc data/.gitignore
git commit -m "track reviews dataset v1"
Expected output of dvc add:
100% Adding...|████████████████████████████████|1/1 [00:00, 12.4file/s]
To track the changes with git, run:
git add data/reviews.csv.dvc data/.gitignore
DVC moved the real CSV into its cache, wrote reviews.csv.dvc (the pointer), and added
the CSV to .gitignore so Git never tries to track the big file. You commit the
pointer, not the data.
Don't be confused:
dvc addvs.git add.git addstages a file into Git.dvc adddoes the opposite — it pulls a file out of Git's tracking and into DVC's, leaving behind a small.dvcpointer that Git does track. You thengit addthe pointer. Two tools, clear division: Git tracks small text (code, pointers), DVC tracks big binaries (data, models).
Pushing data to a remote
The data lives in your local DVC cache until you push it to shared storage:
dvc remote add -d storage s3://my-bucket/dvc-store # configure once
dvc push # upload tracked data
Now a teammate runs git pull (gets the pointers) then dvc pull (downloads the
matching data). Everyone has the same data, versioned in lockstep with the code.
Reproducible pipelines (dvc.yaml)
DVC's second superpower: define your pipeline as stages with declared dependencies and outputs, and DVC only re-runs a stage when its inputs change:
# dvc.yaml
stages:
train:
cmd: python sentiment/train.py
deps:
- sentiment/train.py
- data/reviews.csv # if the data changes, retrain
outs:
- model.json # DVC versions the output model too
metrics:
- metrics.json # track metrics across versions
dvc repro # runs only the stages whose dependencies changed
dvc metrics show # compare metrics across data/code versions
dvc repro is "make for ML": change the data and it knows to retrain; change nothing
and it does nothing. This makes your whole train→evaluate flow reproducible and
cached.
Time-travel to any experiment
Because data, model, and metrics are all versioned alongside the code, you can jump to any past state completely:
git checkout <old-commit> # the code + the .dvc pointers from back then
dvc checkout # the exact data + model that commit used
You're now looking at precisely the data and model that produced last month's result — the thing that's impossible with Git alone, and essential when a regulator or a postmortem asks "what exactly did we train on?"
DVC vs. the alternatives
- Git LFS — also stores big files outside Git, but it's generic and lacks DVC's ML pipelines, metrics, and experiment features.
- lakeFS / Delta Lake — data versioning at the data-lake scale (whole tables, branching petabytes). Bigger hammer.
- MLflow — versions models and runs; DVC versions data and pipelines. They're complementary, and teams often use both.
Don't be confused: DVC vs. MLflow. They overlap but lead with different things. MLflow centers on experiment tracking and the model registry (params, metrics, model versions). DVC centers on data versioning and reproducible pipelines (which data + which code = which result). Use MLflow to compare runs and ship models; use DVC to guarantee you can reproduce the data behind any of them.
The takeaway
DVC versions large data and models with Git but not in it: the big files live in a
remote (S3/GCS), and Git tracks tiny .dvc pointer files, so git checkout + dvc checkout reproduces any past data-and-model state exactly. dvc.yaml turns your
pipeline into cached, dependency-aware stages ("make for ML"). Now that data and models
are versioned, let's automate the loop that produces them — orchestrated retraining. 👉