Counterfactual Risk Minimization (CRM) is a framework for dealing with the logged bandit feedback problem, where the goal is to improve a logging policy using offline data. In this paper, we explore the case where it is possible to deploy learned policies multiple times and acquire new data. We extend the CRM principle and its theory to this scenario, which we call "Sequential Counterfactual Risk Minimization (SCRM)." We introduce a novel counterfactual estimator and identify conditions that can improve the performance of CRM in terms of excess risk and regret rates, by using an analysis similar to restart strategies in accelerated optimization methods. We also provide an empirical evaluation of our method in both discrete and continuous action settings, and demonstrate the benefits of multiple deployments of CRM.
翻译:反事实风险最小化(CRM)是处理记录中的土匪反馈问题的一个框架,其目标是利用离线数据改进伐木政策。在本文中,我们探讨了有可能多次部署学习的政策并获取新数据的案例。我们将CRM原则及其理论推广到这一假设情景,我们称之为“序列反事实风险最小化(SCRM) ” 。 我们引入了一个新的反事实估计器,并确定了能够通过使用类似于快速优化方法重新启动战略的分析,在超风险和遗憾率方面改善CRM绩效的条件。 我们还提供了对各自在离散和连续行动环境中的方法的经验评估,并展示了多部署CRM的好处。