学习切换时间: 组合主计长以绕过地形人工序列 (Learning When to Switch: Composing Controllers to Traverse a Sequence of Terrain Artifacts)

Legged robots often use separate control policiesthat are highly engineered for traversing difficult terrain suchas stairs, gaps, and steps, where switching between policies isonly possible when the robot is in a region that is commonto adjacent controllers. Deep Reinforcement Learning (DRL)is a promising alternative to hand-crafted control design,though typically requires the full set of test conditions to beknown before training. DRL policies can result in complex(often unrealistic) behaviours that have few or no overlappingregions between adjacent policies, making it difficult to switchbehaviours. In this work we develop multiple DRL policieswith Curriculum Learning (CL), each that can traverse asingle respective terrain condition, while ensuring an overlapbetween policies. We then train a network for each destinationpolicy that estimates the likelihood of successfully switchingfrom any other policy. We evaluate our switching methodon a previously unseen combination of terrain artifacts andshow that it performs better than heuristic methods. Whileour method is trained on individual terrain types, it performscomparably to a Deep Q Network trained on the full set ofterrain conditions. This approach allows the development ofseparate policies in constrained conditions with embedded priorknowledge about each behaviour, that is scalable to any numberof behaviours, and prepares DRL methods for applications inthe real world

翻译：被绑住的机器人往往使用不同的控制政策,而这种政策是设计得非常严格的,用于穿越困难地形,如楼梯、缺口和步骤,只有当机器人位于相邻控制器常见的区域时,政策之间才有可能发生转变。深强化学习(DRL)是手工艺控制设计的一个很有希望的替代方案,但通常要求培训前要知道全部测试条件。DRL政策可能导致复杂(通常不切实际)的行为,在相邻政策之间几乎没有或没有重叠的区域,因此难以转换行为。在这项工作中,我们开发了多种DRL政策,与课程学习(CL)相结合,每个政策都可绕过各自的地形条件,同时确保政策之间的重叠。然后,我们为每个目的地政策培训了一个网络,估计成功转换到任何其他政策的可能性。我们评估了我们以前看不见的地形文物组合的转换方法,以及它的表现优于超自然方法。虽然我们的方法是针对单个地形类型进行训练,但与全地形条件训练的深网络相容,每个网络都可穿越不同的地形条件,同时确保政策之间的重叠,同时确保政策之间的重叠。我们随后的行为方式可以发展。