You have logs from the current recommender. How can you estimate whether a new policy would perform better without fully launching it?
formulate your answer, then —
tldr
Counterfactual evaluation estimates a new policy from old-policy logs. IPS uses logged propensities to debias outcomes, but variance and support mismatch are major limitations. Exploration traffic and doubly robust estimators improve reliability, but A/B testing remains necessary for launch decisions.
follow-up
- Why do you need logged propensities for IPS?
- What is the support problem in off-policy evaluation?
- When would doubly robust estimation beat IPS?