In this paper we consider the problem of classifying fine-grained, multi-step activities (e.g., cooking different recipes, making disparate home improvements, creating various forms of arts and crafts) from long videos spanning up to several minutes. Accurately categorizing these activities requires not only recognizing the individual steps that compose the task but also capturing their temporal dependencies. This problem is dramatically different from traditional action classification, where models are typically optimized on videos that span only a few seconds and that are manually trimmed to contain simple atomic actions. While step annotations could enable the training of models to recognize the individual steps of procedural activities, existing large-scale datasets in this area do not include such segment labels due to the prohibitive cost of manually annotating temporal boundaries in long videos. To address this issue, we propose to automatically identify steps in instructional videos by leveraging the distant supervision of a textual knowledge base (wikiHow) that includes detailed descriptions of the steps needed for the execution of a wide variety of complex activities. Our method uses a language model to match noisy, automatically-transcribed speech from the video to step descriptions in the knowledge base. We demonstrate that video models trained to recognize these automatically-labeled steps (without manual supervision) yield a representation that achieves superior generalization performance on four downstream tasks: recognition of procedural activities, step classification, step forecasting and egocentric video classification.
翻译:在本文中,我们考虑从长达几分钟的长视频中对细微的、多层次的活动(例如烹饪不同的食谱、不同的家庭改进、创造不同形式的艺术和工艺品)进行分类的问题。对这些活动进行准确分类,不仅需要认识到构成任务的各个步骤,而且还需要捕捉时间依赖。这个问题与传统的行动分类大不相同,传统行动分类的模型通常最优化于短短短几秒钟的视频,并且手工剪辑,以包含简单的原子行动。虽然步骤说明可以使模型培训能够识别程序活动的各个步骤,但这一领域现有的大型数据集并不包括这种部分标签,这是因为在长视频中手动说明时间界限的费用太高。为了解决这一问题,我们建议利用对文本知识库(wikiHat)的远程监督来自动确定教学视频中的步骤,其中包括详细描述执行范围广泛的复杂活动所需的步骤。我们的方法是用一种语言模型来匹配从视频的、自动调整的步伐发言顺序,从视频的步伐到升级的顺序说明,我们用这些经过培训的视频的顺序定位,在视频的层次上进行自动识别。