Bias correction techniques are used by most of the high-performing methods for off-policy reinforcement learning. However, these techniques rely on a pre-defined bias correction policy that is either not flexible enough or requires environment-specific tuning of hyperparameters. In this work, we present a simple data-driven approach for guiding bias correction. We demonstrate its effectiveness on the Truncated Quantile Critics -- a state-of-the-art continuous control algorithm. The proposed technique can adjust the bias correction across environments automatically. As a result, it eliminates the need for an extensive hyperparameter search, significantly reducing the actual number of interactions and computation.
翻译:多数高性能的非政策强化学习方法都使用比亚纠正技术,但是,这些技术依赖于预先确定的偏差纠正政策,该政策不是不够灵活,就是要求按环境对超参数进行调整。在这项工作中,我们提出了指导偏差纠正的简单数据驱动方法。我们展示了它对于脱节的量子弧式(Qantile Critics)的有效性,这是一种最先进的连续控制算法。拟议的技术可以自动调整跨环境的偏差纠正。因此,它消除了大规模超光度搜索的需要,大大减少了互动和计算的实际数量。