Reinforcement learning (RL) offers the potential for training generally capable agents that can interact autonomously in the real world. However, one key limitation is the brittleness of RL algorithms to core hyperparameters and network architecture choice. Furthermore, non-stationarities such as evolving training data and increased agent complexity mean that different hyperparameters and architectures may be optimal at different points of training. This motivates AutoRL, a class of methods seeking to automate these design choices. One prominent class of AutoRL methods is Population-Based Training (PBT), which have led to impressive performance in several large scale settings. In this paper, we introduce two new innovations in PBT-style methods. First, we employ trust-region based Bayesian Optimization, enabling full coverage of the high-dimensional mixed hyperparameter search space. Second, we show that using a generational approach, we can also learn both architectures and hyperparameters jointly on-the-fly in a single training run. Leveraging the new highly parallelizable Brax physics engine, we show that these innovations lead to large performance gains, significantly outperforming the tuned baseline while learning entire configurations on the fly. Code is available at https://github.com/xingchenwan/bgpbt.
翻译:强化学习(RL)为培训一般有能力、能够在现实世界中自主互动的代理机构提供了潜在的培训潜力。然而,一个关键的限制因素是,RL算法对核心超参数和网络结构选择的微弱性能。此外,非静止性,例如不断演变的培训数据和增加的代理复杂性复杂性意味着不同的超参数和结构在不同的培训点可能是最佳的。AutorRL是一个寻求将这些设计选择自动化的方法类别。AutoRL方法的一个突出类别是基于人口的培训(PBT),这导致在若干大型环境中取得令人印象深刻的性能。在本文中,我们引入了两种基于核心超参数和网络结构选择的新的创新。首先,我们采用了基于信任区域的巴伊西亚最佳化,使高度混合的超参数搜索空间能够完全覆盖。第二,我们用代位方法显示,我们也可以在一次培训中同时学习建筑和超参数。从新的高度平行的布拉克斯物理学引擎中汲取新的高性能。我们展示这些创新导致大规模性能配置,同时在AB公司进行在线学习。