This work addresses challenges in developing conversational assistants that support rich multimodal video interactions to accomplish real-world tasks interactively. We introduce the task of automatically linking instructional videos to task steps as "Video Instructions Linking for Complex Tasks" (VILT). Specifically, we focus on the domain of cooking and empowering users to cook meals interactively with a video-enabled Alexa skill. We create a reusable benchmark with 61 queries from recipe tasks and curate a collection of 2,133 instructional "How-To" cooking videos. Studying VILT with state-of-the-art retrieval methods, we find that dense retrieval with ANCE is the most effective, achieving an NDCG@3 of 0.566 and P@1 of 0.644. We also conduct a user study that measures the effect of incorporating videos in a real-world task setting, where 10 participants perform several cooking tasks with varying multimodal experimental conditions using a state-of-the-art Alexa TaskBot system. The users interacting with manually linked videos said they learned something new 64% of the time, which is a 9% increase compared to the automatically linked videos (55%), indicating that linked video relevance is important for task learning.
翻译:这项工作解决了发展对话助理的挑战,这些助手支持丰富的多式联运视频互动,以完成真实世界的任务。我们引入了将教学视频与任务步骤自动挂钩的任务,如“复杂任务Video指示链接” (VILT) 。具体地说,我们侧重于烹饪领域,并赋予用户以视频驱动的Alexa技能以互动烹饪的烹饪能力。我们创建了一个可重复使用的基准,61个来自配方任务的查询,并整理了2,133个“如何做”的烹饪视频。用最先进的检索方法研究VILT。我们发现,与NCES的密集检索最为有效,实现了0.566和P@1的NDCG@3。我们还开展了一项用户研究,测量了将视频纳入现实世界任务环境的效果,其中10名参与者利用最新技术的Alexa任务平台系统,在不同多式联运试验条件下执行了若干烹饪任务。与手动链接的视频互动的用户说,他们学到了64%的新时间,与自动链接的视频(55%)相比增加了9%,这表明链接的视频是学习的重要任务。