Narrated "how-to" videos have emerged as a promising data source for a wide range of learning problems, from learning visual representations to training robot policies. However, this data is extremely noisy, as the narrations do not always describe the actions demonstrated in the video. To address this problem we introduce the novel task of visual narration detection, which entails determining whether a narration is visually depicted by the actions in the video. We propose "What You Say is What You Show" (WYS^2), a method that leverages multi-modal cues and pseudo-labeling to learn to detect visual narrations with only weakly labeled data. We further generalize our approach to operate on only audio input, learning properties of the narrator's voice that hint if they are currently doing what they describe. Our model successfully detects visual narrations in in-the-wild videos, outperforming strong baselines, and we demonstrate its impact for state-of-the-art summarization and alignment of instructional video.
翻译:叙事“ 如何到” 视频已成为从学习视觉表现到培训机器人政策等一系列学习问题的一个很有希望的数据源。 然而, 这些数据非常吵闹, 因为叙事并不总是描述视频中所展示的行动。 为了解决这个问题, 我们引入了视觉叙事探测的新任务, 包括确定视频中的行为是否以视觉描述出一个叙事。 我们提议“ 你所说的是你所显示的”( WYS) 2, 一种利用多式提示和假标签来学习用标签微弱的数据探测视觉叙事的方法。 我们进一步推广我们的方法, 仅操作音频输入, 学习当前正在做描述的讲员声音的属性。 我们的模型成功地检测了在视频中的视觉叙事, 超过强的基线, 我们展示了它对于最先进的概括和校准教学视频的影响 。