Off-policy evaluation (OPE) is the task of estimating the expected reward of a given policy based on offline data previously collected under different policies. Therefore, OPE is a key step in applying reinforcement learning to real-world domains such as medical treatment, where interactive data collection is expensive or even unsafe. As the observed data tends to be noisy and limited, it is essential to provide rigorous uncertainty quantification, not just a point estimation, when applying OPE to make high stakes decisions. This work considers the problem of constructing non-asymptotic confidence intervals in infinite-horizon off-policy evaluation, which remains a challenging open question. We develop a practical algorithm through a primal-dual optimization-based approach, which leverages the kernel Bellman loss (KBL) of Feng et al.(2019) and a new martingale concentration inequality of KBL applicable to time-dependent data with unknown mixing conditions. Our algorithm makes minimum assumptions on the data and the function class of the Q-function, and works for the behavior-agnostic settings where the data is collected under a mix of arbitrary unknown behavior policies. We present empirical results that clearly demonstrate the advantages of our approach over existing methods.
翻译:离岸评估(OPE)是依据以前在不同政策下收集的离线数据估计某项政策预期回报的任务。 因此,OPE是将强化学习应用到医疗等现实领域的关键一步,这些领域的互动数据收集费用昂贵,甚至不安全。 由于观测到的数据往往噪音和有限,因此,在应用OPE作出高利害关系决策时,必须提供严格的不确定性量化,而不仅仅是点估计。 这项工作考虑了在无限和偏离政策评价中构建非被动信任间隔的问题,这仍然是一个具有挑战性的问题。 我们通过基于原始的优化方法开发了实用算法,利用了风等人(2019年)的贝尔曼核心损失(KBL)和新的马丁格尔浓度不平等(KBL),这些不平等适用于时间依赖的混合条件不明的数据。我们的算法对数据和功能功能功能的功能类别做出了最低假设,并为根据任意的未知行为政策组合收集数据的行为-认知环境开展了工作。我们提出了实证结果,明确展示了现有方法的优势。