The automatic synthesis of policies for robotic-control tasks through reinforcement learning relies on a reward signal that simultaneously captures many possibly conflicting requirements. In this paper, we in\-tro\-duce a novel, hierarchical, potential-based reward-shaping approach (HPRS) for defining effective, multivariate rewards for a large family of such control tasks. We formalize a task as a partially-ordered set of safety, target, and comfort requirements, and define an automated methodology to enforce a natural order among requirements and shape the associated reward. Building upon potential-based reward shaping, we show that HPRS preserves policy optimality. Our experimental evaluation demonstrates HPRS's superior ability in capturing the intended behavior, resulting in task-satisfying policies with improved comfort, and converging to optimal behavior faster than other state-of-the-art approaches. We demonstrate the practical usability of HPRS on several robotics applications and the smooth sim2real transition on two autonomous-driving scenarios for F1TENTH race cars.
翻译:通过强化学习自动整合机器人控制任务的政策取决于一个奖励信号,该信号同时捕捉了许多可能相互冲突的要求。在本文中,我们用新颖的、等级的、基于潜在奖赏的分化方法(HPRS)来界定对大型控制任务家族的有效、多变的奖赏。我们把这项任务正式确定为一套部分有序的安全、目标和舒适要求,并定义一种在各种要求之间执行自然秩序和塑造相关奖赏的自动化方法。在潜在奖赏制的基础上,我们显示HPRS保存了政策的最佳性。我们的实验评估表明,HPRS在捕捉预期行为方面有超强的能力,导致任务满意的政策得到更好的舒适,并比其他最先进的方法更快地融合到最佳行为。我们证明HPRS在几个机器人应用上的实际用途,以及在F1THE竞赛汽车的两个自动驾驶情景上平稳的平滑平平平平平平平平平平平平平平的过渡。