Reward design is a fundamental problem in reinforcement learning (RL). A misspecified or poorly designed reward can result in low sample efficiency and undesired behaviors. In this paper, we propose the idea of \textit{programmatic reward design}, i.e. using programs to specify the reward functions in RL environments. Programs allow human engineers to express sub-goals and complex task scenarios in a structured and interpretable way. The challenge of programmatic reward design, however, is that while humans can provide the high-level structures, properly setting the low-level details, such as the right amount of reward for a specific sub-task, remains difficult. A major contribution of this paper is a probabilistic framework that can infer the best candidate programmatic reward function from expert demonstrations. Inspired by recent generative-adversarial approaches, our framework {searches for the most likely programmatic reward function under which the optimally generated trajectories cannot be differentiated from the demonstrated trajectories}. Experimental results show that programmatic reward functions learned using this framework can significantly outperform those learned using existing reward learning algorithms, and enable RL agents to achieve state-of-the-art performance on highly complex tasks.
翻译:在强化学习(RL)中,奖赏设计是一个根本性的问题。一个错误或设计不当的奖赏可能会导致低抽样效率和不理想的行为。在本文中,我们提出“textit{programmatical奖赏设计”的概念,即使用方案来指定RL环境中的奖赏功能。方案允许人类工程师以结构化和可解释的方式表达子目标和复杂的任务情景。然而,方案奖励设计的挑战在于,虽然人类可以提供高层次的结构,但适当设定低层次的细节,例如特定子任务的适当奖赏数额,仍然困难。本文的主要贡献是一个概率框架,可以从专家演示中推导出最佳候选方案奖赏功能。受最近典型对抗方法的启发,我们的框架{研究最有可能的方案奖励功能,在这个功能下,最佳产生的轨迹无法与所展示的轨迹加以区别}。实验结果显示,利用这一框架学习的方案奖励职能可以大大优于那些通过现有奖励学习复杂算法学习过的人。