学习承认有长期监督的程序性活动 (Learning To Recognize Procedural Activities with Distant Supervision)

In this paper we consider the problem of classifying fine-grained, multi-step activities (e.g., cooking different recipes, making disparate home improvements, creating various forms of arts and crafts) from long videos spanning up to several minutes. Accurately categorizing these activities requires not only recognizing the individual steps that compose the task but also capturing their temporal dependencies. This problem is dramatically different from traditional action classification, where models are typically optimized on videos that span only a few seconds and that are manually trimmed to contain simple atomic actions. While step annotations could enable the training of models to recognize the individual steps of procedural activities, existing large-scale datasets in this area do not include such segment labels due to the prohibitive cost of manually annotating temporal boundaries in long videos. To address this issue, we propose to automatically identify steps in instructional videos by leveraging the distant supervision of a textual knowledge base (wikiHow) that includes detailed descriptions of the steps needed for the execution of a wide variety of complex activities. Our method uses a language model to match noisy, automatically-transcribed speech from the video to step descriptions in the knowledge base. We demonstrate that video models trained to recognize these automatically-labeled steps (without manual supervision) yield a representation that achieves superior generalization performance on four downstream tasks: recognition of procedural activities, step classification, step forecasting and egocentric video classification.

翻译：在本文中,我们考虑从长达几分钟的长视频中对细微的、多层次的活动(例如烹饪不同的食谱、不同的家庭改进、创造不同形式的艺术和工艺品)进行分类的问题。对这些活动进行准确分类,不仅需要认识到构成任务的各个步骤,而且还需要捕捉时间依赖。这个问题与传统的行动分类大不相同,传统行动分类的模型通常最优化于短短短几秒钟的视频,并且手工剪辑,以包含简单的原子行动。虽然步骤说明可以使模型培训能够识别程序活动的各个步骤,但这一领域现有的大型数据集并不包括这种部分标签,这是因为在长视频中手动说明时间界限的费用太高。为了解决这一问题,我们建议利用对文本知识库(wikiHat)的远程监督来自动确定教学视频中的步骤,其中包括详细描述执行范围广泛的复杂活动所需的步骤。我们的方法是用一种语言模型来匹配从视频的、自动调整的步伐发言顺序,从视频的步伐到升级的顺序说明,我们用这些经过培训的视频的顺序定位,在视频的层次上进行自动识别。

相关内容

MoDELS

关注 41

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

计算机科学课程与视频课件合集，Computer Science courses with video lectures

专知会员服务

37+阅读 · 2022年1月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

80+阅读 · 2020年7月26日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

128+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日