Recently, Truncated Quantile Critics (TQC), using distributional representation of critics, was shown to provide state-of-the-art asymptotic training performance on all environments from the MuJoCo continuous control benchmark suite. Also recently, Randomized Ensemble Double Q-Learning (REDQ), using a high update-to-data ratio and target randomization, was shown to achieve high sample efficiency that is competitive with state-of-the-art model-based methods. In this paper, we propose a novel model-free algorithm, Aggressive Q-Learning with Ensembles (AQE), which improves the sample-efficiency performance of REDQ and the asymptotic performance of TQC, thereby providing overall state-of-the-art performance during all stages of training. Moreover, AQE is very simple, requiring neither distributional representation of critics nor target randomization.
翻译:最近,使用批评者分布式代表的快速量控器(TQC)使用批评者分布式代表器,在穆乔科连续控制基准套件中显示,在所有环境中都提供了最新的无症状培训表现,最近,使用高更新数据比和目标随机化的随机化组合组合(REDQ)显示,使用高更新数据比和目标随机化,实现了高样本效率,与最先进的基于模型的方法相比具有竞争力。 在本文中,我们提出了一个新的无型模式算法,即 " 与组合进行侵略性快速学习(AQE) ",该算法改进了REDQ的抽样效率表现和TQC的无症状性表现,从而提供了所有培训阶段的总体最新业绩。 此外,AQE非常简单,既不需要批量的批评者代表,也不需要目标随机化。