Reward design is a fundamental problem in reinforcement learning (RL). A misspecified or poorly designed reward can result in low sample efficiency and undesired behaviors. In this paper, we propose the idea of programmatic reward design, i.e. using programs to specify the reward functions in RL environments. Programs allow human engineers to express sub-goals and complex task scenarios in a structured and interpretable way. The challenge of programmatic reward design, however, is that while humans can provide the high-level structures, properly setting the low-level details, such as the right amount of reward for a specific sub-task, remains difficult. A major contribution of this paper is a probabilistic framework that can infer the best candidate programmatic reward function from expert demonstrations. Inspired by recent generative-adversarial approaches, our framework searches for the most likely programmatic reward function under which the optimally generated trajectories cannot be differentiated from the demonstrated trajectories. Experimental results show that programmatic reward functionslearned using this framework can significantly outperform those learned using existing reward learning algo-rithms, and enable RL agents to achieve state-of-the-artperformance on highly complex tasks.
翻译:奖赏设计是强化学习(RL)的根本问题。 错误指定或设计不当的奖赏可能导致低抽样效率和不可取的行为。 在本文中,我们提出方案奖励设计构想,即使用方案来具体规定RL环境中的奖赏功能。 方案允许人类工程师以结构化和可解释的方式表达次级目标和复杂的任务设想。 但是,方案奖励设计的挑战在于,虽然人类可以提供高层次的结构,适当确定低层次的细节,例如特定子任务的适当奖赏数额,仍然困难。 本文的主要贡献是一个概率性框架,可以推断出专家示范的最佳候选方案奖赏功能。受最近的基因化对抗方法的启发,我们对最可能的方案奖励功能的框架搜索,在这种功能下,最佳产生的轨迹无法与所展示的轨迹区分。实验结果表明,利用这一框架获得的方案奖励职能可以大大优于利用现有的奖赏学习 algo-richms所学到的成绩,并使RL代理人能够完成复杂的状态任务。