In reinforcement learning, distributional off-policy evaluation (OPE) focuses on estimating the return distribution of a target policy using offline data collected under a different policy. This work focuses on extending the widely used fitted Q-evaluation -- developed for expectation-based reinforcement learning -- to the distributional OPE setting. We refer to this extension as fitted distributional evaluation (FDE). While only a few related approaches exist, there remains no unified framework for designing FDE methods. To fill this gap, we present a set of guiding principles for constructing theoretically grounded FDE methods. Building on these principles, we develop several new FDE methods with convergence analysis and provide theoretical justification for existing methods, even in non-tabular environments. Extensive experiments, including simulations on linear quadratic regulators and Atari games, demonstrate the superior performance of the FDE methods.
翻译:在强化学习中,分布式离策略评估(OPE)专注于利用在不同策略下收集的离线数据来估计目标策略的回报分布。本研究致力于将广泛应用于基于期望的强化学习中的拟合Q评估方法扩展至分布式OPE场景。我们将此扩展称为拟合分布评估(FDE)。尽管已有少数相关方法,但目前仍缺乏设计FDE方法的统一框架。为填补这一空白,我们提出了一套构建具有理论依据的FDE方法的指导原则。基于这些原则,我们开发了若干具有收敛性分析的新FDE方法,并为现有方法(即使在非表格化环境中)提供了理论依据。在线性二次调节器和Atari游戏上的大量实验表明,FDE方法具有优越的性能。