We propose a self-supervised algorithm to learn representations from egocentric video data. Recently, significant efforts have been made to capture humans interacting with their own environments as they go about their daily activities. In result, several large egocentric datasets of interaction-rich multi-modal data have emerged. However, learning representations from videos can be challenging. First, given the uncurated nature of long-form continuous videos, learning effective representations require focusing on moments in time when interactions take place. Second, visual representations of daily activities should be sensitive to changes in the state of the environment. However, current successful multi-modal learning frameworks encourage representation invariance over time. To address these challenges, we leverage audio signals to identify moments of likely interactions which are conducive to better learning. We also propose a novel self-supervised objective that learns from audible state changes caused by interactions. We validate these contributions extensively on two large-scale egocentric datasets, EPIC-Kitchens-100 and the recently released Ego4D, and show improvements on several downstream tasks, including action recognition, long-term action anticipation, and object state change classification.
翻译:我们建议采用自我监督的算法,从以自我为中心的视频数据中学习自我监督的表达方式。 最近,我们做出了重大努力,捕捉人类在日常活动中与自己的环境发生互动。结果,出现了几套以自我为中心的互动丰富多模式数据。然而,从视频中学习的表达方式可能具有挑战性。首先,鉴于长式连续视频的不精确性质,学习有效的表达方式需要关注互动发生的时间。第二,日常活动的直观展示应当敏感地注意环境状况的变化。然而,目前成功的多模式学习框架鼓励了随时间变化而变化的代表方式。为了应对这些挑战,我们利用音频信号来确定可能互动的时刻,从而有利于更好的学习。我们还提出了一个全新的自我监督目标,从互动造成的明显变化中学习。我们把这些贡献广泛用于两个大型的自我中心数据集,即EPIC-Kitchens-100和最近发布的Ego4D,并显示一些下游任务的改进,包括行动识别、长期行动预测和反对状态变化分类。