mlprep
mlprep/ML Breadthhard12 min

Your model improves NDCG by 2% in offline evaluation but shows no movement in the A/B test. Walk me through your debugging checklist.

formulate your answer, then —

tldr

Offline-online gap has three root causes: (1) offline eval problems — stale data, biased labels, wrong metric, unrepresentative set; (2) experiment design — insufficient power, diluted treatment, implementation bugs; (3) metric misalignment — optimizing a proxy that doesn't drive the business metric. Debug in order: validate deployment → check statistical power → slice by affected population → verify metric correlation → extend runtime.

follow-up

  • How do you compute minimum detectable effect for an A/B test and decide whether your experiment has enough power?
  • What are surrogate metrics and how do you validate that a surrogate metric reliably predicts the business metric before relying on it for launch decisions?
  • You find that NDCG improvement comes entirely from power users but the A/B test primary metric is averaged across all users. How do you surface this to stakeholders?