Learning from Demonstration (LfD) approaches empower end-users to teach robots novel tasks via demonstrations of the desired behaviors, democratizing access to robotics. However, current LfD frameworks are not capable of fast adaptation to heterogeneous human demonstrations nor the large-scale deployment in ubiquitous robotics applications. In this paper, we propose a novel LfD framework, Fast Lifelong Adaptive Inverse Reinforcement learning (FLAIR). Our approach (1) leverages learned strategies to construct policy mixtures for fast adaptation to new demonstrations, allowing for quick end-user personalization; (2) distills common knowledge across demonstrations, achieving accurate task inference; and (3) expands its model only when needed in lifelong deployments, maintaining a concise set of prototypical strategies that can approximate all behaviors via policy mixtures. We empirically validate that FLAIR achieves adaptability (i.e., the robot adapts to heterogeneous, user-specific task preferences), efficiency (i.e., the robot achieves sample-efficient adaptation), and scalability (i.e., the model grows sublinearly with the number of demonstrations while maintaining high performance). FLAIR surpasses benchmarks across three continuous control tasks with an average 57% improvement in policy returns and an average 78% fewer episodes required for demonstration modeling using policy mixtures. Finally, we demonstrate the success of FLAIR in a real-robot table tennis task.
翻译:从演示(LfD)中学习的方法使终端用户能够通过展示理想行为、实现机器人进入的民主化,向机器人传授新任务。然而,目前的LfD框架无法快速适应各种人类演示或大规模部署无处不在的机器人应用,无法快速适应各种人类演示或大规模部署。在本文中,我们提议了一个全新的LfD框架,即快速寿命适应性反强化学习(FLAIR) 。我们的方法(1) 利用学习后的战略来构建政策混合物,以便快速适应新的演示,允许用户快速个人化;(2) 在整个演示中积累共同知识,实现准确的任务推导;(3) 只有在终身部署中需要时,才扩展其模式,维持一套简明的原型战略,通过政策混合物可以接近所有的行为。我们从经验上证实,FLAIR实现了适应性(即机器人适应差异性、用户特有的任务偏好)、效率(即机器人实现样本高效适应性适应性适应)以及可扩展性(即我们在57号网上用户个人化的适应性适应性)和可扩展性(即模型与演示的次线次模型增长,在57号模型中增加演示数量,同时保持高性A 平均政策回报,在FAA 上,在超过要求的平均回报上,在FA 上,在超过一个平均政策上,在超过一个标准上,在FAA 上,在超过一个标准上显示一个标准上,在超过一个标准。