Our work aims at efficiently leveraging ambiguous demonstrations for the training of a reinforcement learning (RL) agent. An ambiguous demonstration can usually be interpreted in multiple ways, which severely hinders the RL-Agent from learning stably and efficiently. Since an optimal demonstration may also suffer from being ambiguous, previous works that combine RL and learning from demonstration (RLfD works) may not work well. Inspired by how humans handle such situations, we propose to use self-explanation (an agent generates explanations for itself) to recognize valuable high-level relational features as an interpretation of why a successful trajectory is successful. This way, the agent can provide some guidance for its RL learning. Our main contribution is to propose the Self-Explanation for RL from Demonstrations (SERLfD) framework, which can overcome the limitations of traditional RLfD works. Our experimental results show that an RLfD model can be improved by using our SERLfD framework in terms of training stability and performance.
 翻译:我们的工作旨在有效地利用模棱两可的示范来培训强化学习(RL)剂。模棱两可的示范通常可以用多种方式解释,这严重妨碍RL-Agent人员不断和有效地学习。由于最佳示范也可能受到模棱两可的影响,以前将RL和从示范中学习(RLfD作品)结合起来的工程可能不会很好。受人类如何处理这种情况的启发,我们提议利用自我勘探(一个代理人员为自己作出解释)来认识宝贵的高层次关系特征,以此来解释成功轨迹为何成功。这样,该代理人员可以为RL学习提供一些指导。我们的主要贡献是提出示范(SERLfD)框架的自我脱贫,该框架可以克服传统的RLfD作品的局限性。我们的实验结果表明,在培训稳定性和绩效方面利用我们的SERLfD框架可以改进RLfD模型。