How do you track ML experiments and manage model versions in a team setting? What breaks down when you don't have a system for this?
formulate your answer, then —
You mentioned logging the git commit hash for reproducibility — what else do you need to actually reproduce a training run, and what's usually missing in practice?
formulate your answer, then —
tldr
Experiment tracking logs parameters, metrics, artifacts, and code per run — enabling comparison and retrieval. A model registry manages lifecycle from experiment to production, decoupling deployment from version numbers. Full reproducibility requires pinning code (git hash), data (DVC or snapshots), environment (Docker), and random seeds. Data versioning is the piece most teams skip and most often regret.
follow-up
- How would you design an experiment tracking system for a team that runs 500 experiments per day?
- What's the difference between a model registry and a model store, and when do you need both?
- How do you handle the case where two experiments produce identical offline metrics but different production performance?