学习和验证指导视频中的任务结构 (Learning and Verification of Task Structure in Instructional Videos)

Given the enormous number of instructional videos available online, learning a diverse array of multi-step task models from videos is an appealing goal. We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos. We pre-train VideoTaskformer using a simple and effective objective: predicting weakly supervised textual labels for steps that are randomly masked out from an instructional video (masked step modeling). Compared to prior work which learns step representations locally, our approach involves learning them globally, leveraging video of the entire surrounding task as context. From these learned representations, we can verify if an unseen video correctly executes a given task, as well as forecast which steps are likely to be taken after a given step. We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order. We also introduce a long-term forecasting benchmark, where the goal is to predict long-range future steps from a given step. Our method outperforms previous baselines on these tasks, and we believe the tasks will be a valuable way for the community to measure the quality of step representations. Additionally, we evaluate VideoTaskformer on 3 existing benchmarks -- procedural activity recognition, step classification, and step forecasting -- and demonstrate on each that our method outperforms existing baselines and achieves new state-of-the-art performance.

翻译：鉴于在线上有大量的指导视频，从视频中学习各种多步骤任务模型是一个非常有吸引力的目标。本文介绍了一种新的预训练视频模型 VideoTaskformer，专注于表示指导视频的语义和结构。我们使用一种简单而有效的目标函数——预测从指导视频中随机遮蔽的步骤的文本标签（遮蔽步骤建模）——对 VideoTaskformer 进行预训练。与以前学习局部步骤表示的工作相比，我们的方法涉及全局学习这些表示，利用整个周围任务的视频作为上下文进行学习。通过这些学习到的表示，我们可以验证未见过的视频是否正确执行给定的任务，并预测在给定步骤之后可能采取哪些步骤。我们为侦测指导视频中的错误引入了两个新的基准，以验证是否存在异常步骤并且步骤是否按正确顺序执行。我们还引入了一个远期预测基准，目标是从给定步骤预测长期未来的步骤。我们的方法在这些任务上优于以前的基线，我们相信这些任务将成为社区评估步骤表示质量的有价值的方式。此外，我们评估了 VideoTaskformer 在三个现有的基准——程序活动识别、步骤分类和步骤预测——上的性能，并证明我们的方法在每个基准上都优于现有的基线，并获得了新的最先进的性能。