Overestimation bias control techniques are used by the majority of high-performing off-policy reinforcement learning algorithms. However, most of these techniques rely on pre-defined bias correction policies that are either not flexible enough or require environment-specific tuning of hyperparameters. In this work, we present a general data-driven approach for the automatic selection of bias control hyperparameters. We demonstrate its effectiveness on three algorithms: Truncated Quantile Critics, Weighted Delayed DDPG, and Maxmin Q-learning. The proposed technique eliminates the need for an extensive hyperparameter search. We show that it leads to a significant reduction of the actual number of interactions while preserving the performance.
翻译:高超估计偏差控制技术被大多数业绩高超的强化政策学习算法所使用,但是,这些技术大多依赖于预先确定的偏差纠正政策,这些政策不够灵活,或要求按环境对超参数进行调整。在这项工作中,我们提出了自动选择偏差控制超参数的一般数据驱动方法。我们在三种算法上展示了它的有效性:快速量子计算法、加权延迟DDPG和Maxmin Q 学习法。拟议的技术消除了大规模超参数搜索的需要。我们表明,在保持性能的同时,它导致实际互动次数的大幅减少。