A misspecified reward can degrade sample efficiency and induce undesired behaviors in reinforcement learning (RL) problems. We propose symbolic reward machines for incorporating high-level task knowledge when specifying the reward signals. Symbolic reward machines augment existing reward machine formalism by allowing transitions to carry predicates and symbolic reward outputs. This formalism lends itself well to inverse reinforcement learning, whereby the key challenge is determining appropriate assignments to the symbolic values from a few expert demonstrations. We propose a hierarchical Bayesian approach for inferring the most likely assignments such that the concretized reward machine can discriminate expert demonstrated trajectories from other trajectories with high accuracy. Experimental results show that learned reward machines can significantly improve training efficiency for complex RL tasks and generalize well across different task environment configurations.
翻译:指定错误的奖励可以降低抽样效率,并诱发强化学习(RL)问题中的不理想行为。 我们提出象征性奖励机器,以便在指定奖励信号时纳入高级任务知识。 象征性奖励机器通过允许过渡以携带上游和象征性奖励产出来强化现有的奖励机器形式主义。 这种形式主义本身有利于反向强化学习, 关键挑战是确定从少数专家示范活动中适当分配象征性价值。 我们建议采用一种等级分级的贝叶斯式方法来推断最可能的任务, 以便精准化奖励机器能够对来自其他轨道的专家显示的轨迹进行歧视。 实验结果显示, 学习的奖励机器可以大大提高复杂的RL任务的培训效率, 并全面推广不同的任务环境配置。