How to select between policies and value functions produced by different training algorithms in offline reinforcement learning (RL) -- which is crucial for hyperpa-rameter tuning -- is an important open question. Existing approaches based on off-policy evaluation (OPE) often require additional function approximation and hence hyperparameters, creating a chicken-and-egg situation. In this paper, we design hyperparameter-free algorithms for policy selection based on BVFT [XJ21], a recent theoretical advance in value-function selection, and demonstrate their effectiveness in discrete-action benchmarks such as Atari. To address performance degradation due to poor critics in continuous-action domains, we further combine BVFT with OPE to get the best of both worlds, and obtain a hyperparameter-tuning method for Q-function based OPE with theoretical guarantees as a side product.
翻译:如何在非在线强化学习(RL)中不同培训算法产生的政策和价值功能之间作出选择(RL对超光谱调试至关重要)是一个重要的未决问题。基于非政策评价(OPE)的现有方法往往需要额外的功能近似值,从而产生超参数,从而造成鸡蛋状况。在本文中,我们设计了基于BVFT [XJ21] 的无超参数算法来进行政策选择,这是最近在价值功能选择方面的理论进步,并表明其在Atari等离散行动基准中的有效性。为了解决持续行动领域批评意见差导致的性能退化问题,我们进一步将BVFT与OPE结合起来,以获得世界最佳功能,并获得基于功能的超参数调法,同时以理论保证作为副产品。