We are concerned with the problem of hyperparameter selection of offline policy evaluation (OPE). OPE is a key component of offline reinforcement learning, which is a core technology for data-driven decision optimization without environment simulators. However, the current state-of-the-art OPE methods are not hyperparameter-free, which undermines their utility in real-life applications. We address this issue by introducing a new approximate hyperparameter selection (AHS) framework for OPE, which defines a notion of optimality (called selection criteria) in a quantitative and interpretable manner without hyperparameters. We then derive four AHS methods each of which has different characteristics such as convergence rate and time complexity. Finally, we verify effectiveness and limitation of these methods with a preliminary experiment.
翻译:我们关切离线政策评价的超参数选择问题,离线政策评价是离线强化学习的一个关键组成部分,这是数据驱动决策优化的核心技术,没有环境模拟器,然而,目前的离线政策评价最新方法并非无超参数,这有损于其在实际应用中的效用。我们通过为离线政策评价引入一个新的近似超参数选择框架来解决这一问题,该框架以定量和可解释的方式界定了最佳性概念(所谓的选择标准),而没有超参数。我们随后得出四种离线决策优化方法,其中每种方法都有不同的特征,如趋同率和时间复杂性。最后,我们通过初步试验来核实这些方法的有效性和局限性。