Dynamic platforms that operate over many unique terrain conditions typically require many behaviours. To transition safely, there must be an overlap of states between adjacent controllers. We develop a novel method for training setup policies that bridge the trajectories between pre-trained Deep Reinforcement Learning (DRL) policies. We demonstrate our method with a simulated biped traversing a difficult jump terrain, where a single policy fails to learn the task, and switching between pre-trained policies without setup policies also fails. We perform an ablation of key components of our system, and show that our method outperforms others that learn transition policies. We demonstrate our method with several difficult and diverse terrain types, and show that we can use setup policies as part of a modular control suite to successfully traverse a sequence of complex terrains. We show that using setup policies improves the success rate for traversing a single difficult jump terrain (from 51.3% success rate with the best comparative method to 82.2%), and traversing a random sequence of difficult obstacles (from 1.9% without setup policies to 71.2%).
翻译:在许多独特的地形条件下运行的动态平台通常需要许多行为。 要安全地过渡, 相邻控制器之间必须存在国家重叠。 我们开发了一种新的培训设置政策方法, 将经过训练的深强化学习( DRL) 政策之间的轨迹连接起来。 我们用模拟双曲跳跃地形展示了我们的方法, 单项政策无法学习任务, 未经制定政策而将预先训练的政策转换为之间也失败了。 我们将系统的关键组成部分进行整合, 并显示我们的方法优于学习过渡政策的其他方。 我们用几种困难和多样的地形类型展示了我们的方法。 我们展示了可以使用设置政策作为模块控制组合的一部分来成功穿越一系列复杂地形。 我们显示, 使用设置政策可以提高单项艰难的跳跃地形的成功率( 从最佳比较方法的51.3%成功率到82.2%), 以及随机设置一系列困难障碍( 从1.9%没有制定政策到71.2%) 。