Humans excel at learning long-horizon tasks from demonstrations augmented with textual commentary, as evidenced by the burgeoning popularity of tutorial videos online. Intuitively, this capability can be separated into 2 distinct subtasks - first, dividing a long-horizon demonstration sequence into semantically meaningful events; second, adapting such events into meaningful behaviors in one's own environment. Here, we present Video2Skill (V2S), which attempts to extend this capability to artificial agents by allowing a robot arm to learn from human cooking videos. We first use sequence-to-sequence Auto-Encoder style architectures to learn a temporal latent space for events in long-horizon demonstrations. We then transfer these representations to the robotic target domain, using a small amount of offline and unrelated interaction data (sequences of state-action pairs of the robot arm controlled by an expert) to adapt these events into actionable representations, i.e., skills. Through experiments, we demonstrate that our approach results in self-supervised analogy learning, where the agent learns to draw analogies between motions in human demonstration data and behaviors in the robotic environment. We also demonstrate the efficacy of our approach on model learning - demonstrating how Video2Skill utilizes prior knowledge from human demonstration to outperform traditional model learning of long-horizon dynamics. Finally, we demonstrate the utility of our approach for non-tabula rasa decision-making, i.e, utilizing video demonstration for zero-shot skill generation.
翻译:人类非常擅长从演示中学习长视距任务,通过文字评论来强化演示,这表现在在线辅导视频的流行程度日益提高。我们首先使用顺序到顺序的自动编码风格结构来学习长视演示中事件的时间潜伏空间。我们然后将这些表达方式分解为两个不同的子任务,首先,将长正方位演示序列分解为具有地震意义的事件;其次,利用少量离线和不相关的互动数据(由专家控制的机器人臂的状态-动作配对序列)将此类事件转化为在自己环境中有意义的行为。这里,我们介绍视频2Skill(V2S),试图通过允许机器人臂从人类烹饪视频中学习来将这种能力扩大到人工代理。我们首先使用顺序到顺序的自动编码风格结构结构结构来学习时间上的潜在空间。我们然后将这些表达方式转移到机器人目标领域,利用少量的离线和不相关的交互互动数据(由专家控制的机器人臂的状态-双对立关系)来将这些事件发展为可操作的表达方式,即技能。我们通过实验,展示我们的方法在自我超比比类学习中的结果学习,在人类演示中学会如何在人类演示前的模型中模拟中进行模拟的模拟学习。