快速终身适应性逆强化学习从演示中学习 (Fast Lifelong Adaptive Inverse Reinforcement Learning from Demonstrations)

Learning from Demonstration (LfD) approaches empower end-users to teach robots novel tasks via demonstrations of the desired behaviors, democratizing access to robotics. However, current LfD frameworks are not capable of fast adaptation to heterogeneous human demonstrations nor the large-scale deployment in ubiquitous robotics applications. In this paper, we propose a novel LfD framework, Fast Lifelong Adaptive Inverse Reinforcement learning (FLAIR). Our approach (1) leverages learned strategies to construct policy mixtures for fast adaptation to new demonstrations, allowing for quick end-user personalization, (2) distills common knowledge across demonstrations, achieving accurate task inference; and (3) expands its model only when needed in lifelong deployments, maintaining a concise set of prototypical strategies that can approximate all behaviors via policy mixtures. We empirically validate that FLAIR achieves adaptability (i.e., the robot adapts to heterogeneous, user-specific task preferences), efficiency (i.e., the robot achieves sample-efficient adaptation), and scalability (i.e., the model grows sublinearly with the number of demonstrations while maintaining high performance). FLAIR surpasses benchmarks across three control tasks with an average 57% improvement in policy returns and an average 78% fewer episodes required for demonstration modeling using policy mixtures. Finally, we demonstrate the success of FLAIR in a table tennis task and find users rate FLAIR as having higher task (p<.05) and personalization (p<.05) performance.

翻译：摘要：从演示学习（LfD）方法通过教授机器人所需行为的演示，使终端用户能够教机器人新任务，使机器人实现民主化。然而，当前的LfD框架无法快速适应异质的人类演示，也不能在普适机器人应用中进行大规模部署。在本文中，我们提出了一种新的LfD框架，快速终身适应性逆强化学习（FLAIR）。我们的方法（1）利用学习的策略构建策略混合物，以快速适应新的演示，实现快速的终用户个性化；（2）提炼演示之间的共同知识，实现准确的任务推理；以及（3）只有在终身部署中需要时扩展其模型，通过策略混合物维护一个简明的原型策略集，这些策略集可以通过策略混合物近似所有行为。我们通过实验证明，FLAIR实现了适应性（即机器人适应异质的用户特定任务偏好）、高效性（即机器人实现了样本高效的适应）和可伸缩性（即模型随着演示数量增长而呈子线性增长，同时保持高性能）。 FLAIR在三个控制任务中超过基准，平均策略回报率提高了57％，使用策略混合物的演示建模所需的平均次数减少了78％。最后，我们展示了FLAIR在乒乓球任务中的成功，并发现用户对FLAIR的任务（p＜.05）和个性化（p＜.05）性能更高。