Model-free off-policy actor-critic methods are an efficient solution to complex continuous control tasks. However, these algorithms rely on a number of design tricks and hyperparameters, making their application to new domains difficult and computationally expensive. This paper creates an evolutionary approach that automatically tunes these design decisions and eliminates the RL-specific hyperparameters from the Soft Actor-Critic algorithm. Our design is sample efficient and provides practical advantages over baseline approaches, including improved exploration, generalization over multiple control frequencies, and a robust ensemble of high-performance policies. Empirically, we show that our agent outperforms well-tuned hyperparameter settings in popular benchmarks from the DeepMind Control Suite. We then apply it to less common control tasks outside of simulated robotics to find high-performance solutions with minimal compute and research effort.
翻译:无模型的不受政策限制的行为者-批评方法是复杂连续控制任务的有效解决办法。 但是,这些算法依赖许多设计技巧和超参数,使得其应用于新的领域变得困难和计算成本昂贵。本文创建了一种渐进式方法,自动调整这些设计决定,从Soft Acoror-Critic 算法中删除了针对RL的超参数。我们的设计是抽样式的,为基线方法提供了实际优势,包括改进探索、对多个控制频率的普及化和一套强有力的高性能政策。我们经常地表明,我们的代理器在深海控制套件的流行基准中,优度的超光谱仪设置已经超越了。然后,我们将其应用到模拟机器人以外的较不常见的控制任务中,以最小的计算和研究努力来找到高性能的解决方案。