MLflow: experiment tracking
The moment you train a model more than once — a different learning rate, more
epochs, new features — you face the question that haunts every ML project: which run
was best, and what settings produced it? Without tooling the answer lives in
scrollback, sticky notes, and filenames like model_final_v3_REALLY_final.json.
MLflow replaces that chaos with a record of every run's parameters, metrics, and
artifacts.
Install:
pip install mlflow— that's all this chapter needs.
What MLflow Tracking gives you
- Parameters — the inputs you chose (learning rate, epochs, vocab size).
- Metrics — the numbers you got (train/test accuracy, loss).
- Artifacts — the files produced (the model, plots, configs).
- A UI — a web dashboard to sort, filter, and compare runs side by side.
Together these make experiments reproducible ("run 17 used lr=0.5, 500 epochs") and comparable ("lr=0.5 beat lr=0.1 by 4 points").
The five lines that matter
You wrap a training run in start_run() and log what you care about. From
train.py:
import mlflow
mlflow.set_experiment("sentiment") # a named bucket of runs
with mlflow.start_run() as run: # one run = one experiment
mlflow.log_params({"lr": lr, "epochs": epochs, "vocab_size": len(model.vocab)})
mlflow.log_metrics({"train_accuracy": train_acc, "test_accuracy": test_acc})
mlflow.log_artifact("model.json") # save the model file with the run
That's the whole API surface for tracking: log_params, log_metrics,
log_artifact. Everything else is convenience on top.
Running it
Our train.py logs only when you pass --mlflow, so the core stays dependency-free:
cd code/sentiment
python train.py --mlflow
Output:
2026/06/28 04:09:24 INFO mlflow.tracking.fluent: Experiment with name 'sentiment' does not exist. Creating a new experiment.
examples: 50 train=38 test=12
vocab size: 68
train_accuracy: 1.000
test_accuracy: 1.000
saved model -> model.json
logged run 17dfd654a50e4a6d8a6f0271318d572e to MLflow
MLflow created a sentiment experiment and recorded a run with a unique ID. Every
time you run it — with different settings — you get another logged run you can
compare. Run it a few times with different flags:
python train.py --mlflow --lr 0.1 --epochs 200
python train.py --mlflow --lr 1.0 --epochs 1000
Viewing & comparing runs in the UI
MLflow wrote everything to a local mlruns/ folder. Launch the dashboard:
mlflow ui --backend-store-uri ./mlruns --port 5000 # open http://localhost:5000
You'll see a sortable table — one row per run, columns for every param and metric.
Click two runs and Compare to see them side by side, sort by test_accuracy to
find the winner, and download any run's logged model.json. This table is the
"which run was best?" answer, forever.
Autolog: tracking for free
For popular frameworks, MLflow can capture everything automatically — one line
instead of manual log_* calls:
mlflow.autolog() # then just train as usual
# for sklearn/pytorch/xgboost it auto-logs params, metrics, and the model
autolog() is the fastest way to get tracking into an existing scikit-learn,
PyTorch, or XGBoost script — it hooks the framework's fit() and records params,
metrics, and the model with zero extra code.
Where the data lives (and how it scales)
- Local (this chapter):
mlruns/is a folder on disk — perfect for solo work. - Team: run an MLflow tracking server backed by a database (Postgres) and
object storage (S3), and point everyone at it with
mlflow.set_tracking_uri("http://mlflow.mycompany.com"). Now the whole team's experiments are in one searchable place.
Don't be confused: tracking vs. the registry. Tracking (this chapter) records experiments — every run you ever did, good or bad. The registry (next chapter) manages the chosen models — versioned, named, promoted to production. Tracking is your lab notebook; the registry is your shipping shelf. They're complementary, and MLflow does both.
Why this matters in practice
- Reproducibility — months later you can see exactly what produced a model.
- Comparison — no more guessing whether the new idea actually helped.
- Collaboration — teammates see your runs instead of asking "what did you try?"
- Auditability — regulated industries require a record of how each model was built. MLflow is that record.
This is the difference between "I think the new model is better" and "run #17 improved test accuracy from 0.92 to 0.96; here's the proof and the exact config."
The takeaway
MLflow Tracking records every run's params, metrics, and artifacts with three calls
(log_params, log_metrics, log_artifact), gives you a UI to compare them, and
scales from a local folder to a shared team server. It turns "which experiment was
best?" from archaeology into a database query. Now that we can find our best model,
let's manage and ship it with the MLflow Model Registry. 👉