In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy using logged trajectory data generated from a different behavior policy, without execution of the target policy. Reinforcement learning in high-stake environments, such as healthcare and education, is often limited to off-policy settings due to safety or ethical concerns, or inability of exploration. Hence it is imperative to quantify the uncertainty of the off-policy estimate before deployment of the target policy. In this paper, we propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged trajectories data. Leveraging methodologies from distributionally robust optimization, we show that with proper selection of the size of the distributional uncertainty set, these estimates serve as confidence bounds with non-asymptotic and asymptotic guarantees under stochastic or adversarial environments. Our results are also generalized to batch reinforcement learning and are supported by empirical analysis.
翻译:在一个相继的决策问题中,政策外评价利用不执行目标政策而使用不同行为政策产生的记录轨迹数据估计一项目标政策的预期累积报酬。由于安全或伦理问题,或者由于无法探索,在保健和教育等高风险环境中加强学习往往局限于政策外环境。因此,在部署目标政策之前,必须量化非政策性估计的不确定性。在本文件中,我们提出了一个新框架,利用一个或多个记录轨迹数据提供可靠和乐观的累积奖励估计数。利用分布式强力优化的方法,我们表明,在适当选择分配不确定性集的规模时,这些估计数在随机或对抗环境下,作为信任的界限,与非被动和无保障。我们的结果还普遍化为分批强化学习,并得到经验分析的支持。