People often watch videos on the web to learn how to cook new recipes, assemble furniture or repair a computer. We wish to enable robots with the very same capability. This is challenging; there is a large variation in manipulation actions and some videos even involve multiple persons, who collaborate by sharing and exchanging objects and tools. Furthermore, the learned representations need to be general enough to be transferable to robotic systems. On the other hand, previous work has shown that the space of human manipulation actions has a linguistic, hierarchical structure that relates actions to manipulated objects and tools. Building upon this theory of language for action, we propose a system for understanding and executing demonstrated action sequences from full-length, real-world cooking videos on the web. The system takes as input a new, previously unseen cooking video annotated with object labels and bounding boxes, and outputs a collaborative manipulation action plan for one or more robotic arms. We demonstrate performance of the system in a standardized dataset of 100 YouTube cooking videos, as well as in six full-length Youtube videos that include collaborative actions between two participants. We compare our system with a baseline system that consists of a state-of-the-art action detection baseline and show our system achieves higher action detection accuracy. We additionally propose an open-source platform for executing the learned plans in a simulation environment as well as with an actual robotic arm.
翻译:人们经常在网上观看视频,学习如何烹饪新配方、组装家具或修理计算机。我们希望使具有同样能力的机器人能够使用这种技术。这是具有挑战性的;操纵行动差异很大,有些视频甚至涉及多人,他们通过分享和交换物品和工具进行合作。此外,学习到的表述方式必须十分笼统,足以可转让给机器人系统。另一方面,以往的工作表明,人类操纵行动的空间具有语言和等级结构,与被操纵物体和工具的行动相关。根据这一语言理论,我们提议建立一个系统,从网上全长、真实世界烹饪视频中理解和执行所显示的行动序列。该系统将一个新的、以前不为人知的烹饪视频作为输入,配有物体标签和捆绑框,并产生一个协作操作操作行动计划,供机器人或多件武器使用。我们用一个标准化的数据集展示了该系统的性能,100个YouTube烹饪视频和六部全长度Youtube视频,其中包括两个参与者之间的协作行动。我们把我们的系统与一个由州-艺术实际检测和实时模拟操作平台组成的高级基线系统进行比较,我们用一个新的模型进行操作,以进行新的模拟模拟模拟,并显示我们所学的系统实现了。