MLflow: experiment tracking

The moment you train a model more than once — a different learning rate, more epochs, new features — you face the question that haunts every ML project: which run was best, and what settings produced it? Without tooling the answer lives in scrollback, sticky notes, and filenames like model_final_v3_REALLY_final.json. MLflow replaces that chaos with a record of every run's parameters, metrics, and artifacts.

Install: pip install mlflow — that's all this chapter needs.

What MLflow Tracking gives you

  • Parameters — the inputs you chose (learning rate, epochs, vocab size).
  • Metrics — the numbers you got (train/test accuracy, loss).
  • Artifacts — the files produced (the model, plots, configs).
  • A UI — a web dashboard to sort, filter, and compare runs side by side.

Together these make experiments reproducible ("run 17 used lr=0.5, 500 epochs") and comparable ("lr=0.5 beat lr=0.1 by 4 points").

The five lines that matter

You wrap a training run in start_run() and log what you care about. From train.py:

import mlflow

mlflow.set_experiment("sentiment")               # a named bucket of runs
with mlflow.start_run() as run:                  # one run = one experiment
    mlflow.log_params({"lr": lr, "epochs": epochs, "vocab_size": len(model.vocab)})
    mlflow.log_metrics({"train_accuracy": train_acc, "test_accuracy": test_acc})
    mlflow.log_artifact("model.json")            # save the model file with the run

That's the whole API surface for tracking: log_params, log_metrics, log_artifact. Everything else is convenience on top.

Running it

Our train.py logs only when you pass --mlflow, so the core stays dependency-free:

cd code/sentiment
python train.py --mlflow

Output:

2026/06/28 04:09:24 INFO mlflow.tracking.fluent: Experiment with name 'sentiment' does not exist. Creating a new experiment.
examples: 50  train=38  test=12
vocab size: 68
train_accuracy: 1.000
test_accuracy:  1.000
saved model -> model.json
logged run 17dfd654a50e4a6d8a6f0271318d572e to MLflow

MLflow created a sentiment experiment and recorded a run with a unique ID. Every time you run it — with different settings — you get another logged run you can compare. Run it a few times with different flags:

python train.py --mlflow --lr 0.1 --epochs 200
python train.py --mlflow --lr 1.0 --epochs 1000

Viewing & comparing runs in the UI

MLflow wrote everything to a local mlruns/ folder. Launch the dashboard:

mlflow ui --backend-store-uri ./mlruns --port 5000      # open http://localhost:5000

You'll see a sortable table — one row per run, columns for every param and metric. Click two runs and Compare to see them side by side, sort by test_accuracy to find the winner, and download any run's logged model.json. This table is the "which run was best?" answer, forever.

Autolog: tracking for free

For popular frameworks, MLflow can capture everything automatically — one line instead of manual log_* calls:

mlflow.autolog()          # then just train as usual
# for sklearn/pytorch/xgboost it auto-logs params, metrics, and the model

autolog() is the fastest way to get tracking into an existing scikit-learn, PyTorch, or XGBoost script — it hooks the framework's fit() and records params, metrics, and the model with zero extra code.

Where the data lives (and how it scales)

  • Local (this chapter): mlruns/ is a folder on disk — perfect for solo work.
  • Team: run an MLflow tracking server backed by a database (Postgres) and object storage (S3), and point everyone at it with mlflow.set_tracking_uri("http://mlflow.mycompany.com"). Now the whole team's experiments are in one searchable place.

Don't be confused: tracking vs. the registry. Tracking (this chapter) records experiments — every run you ever did, good or bad. The registry (next chapter) manages the chosen models — versioned, named, promoted to production. Tracking is your lab notebook; the registry is your shipping shelf. They're complementary, and MLflow does both.

Why this matters in practice

  • Reproducibility — months later you can see exactly what produced a model.
  • Comparison — no more guessing whether the new idea actually helped.
  • Collaboration — teammates see your runs instead of asking "what did you try?"
  • Auditability — regulated industries require a record of how each model was built. MLflow is that record.

This is the difference between "I think the new model is better" and "run #17 improved test accuracy from 0.92 to 0.96; here's the proof and the exact config."

The takeaway

MLflow Tracking records every run's params, metrics, and artifacts with three calls (log_params, log_metrics, log_artifact), gives you a UI to compare them, and scales from a local folder to a shared team server. It turns "which experiment was best?" from archaeology into a database query. Now that we can find our best model, let's manage and ship it with the MLflow Model Registry. 👉