In this paper, we introduce a new task, spoken video grounding (SVG), which aims to localize the desired video fragments from spoken language descriptions. Compared with using text, employing audio requires the model to directly exploit the useful phonemes and syllables related to the video from raw speech. Moreover, we randomly add environmental noises to this speech audio, further increasing the difficulty of this task and better simulating real applications. To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL) during the audio pre-training process, which can make use of the vital visual perceptions to help understand the spoken language and suppress the external noise. Considering during inference the model can not obtain ground truth video segments, we design a curriculum strategy that gradually shifts the input video from the ground truth to the entire video content during pre-training. Finally, the model can learn how to extract critical visual information from the entire video clip to help understand the spoken language. In addition, we collect the first large-scale spoken video grounding dataset based on ActivityNet, which is named as ActivityNet Speech dataset. Extensive experiments demonstrate our proposed video-guided curriculum learning can facilitate the pre-training process to obtain a mutual audio encoder, significantly promoting the performance of spoken video grounding tasks. Moreover, we prove that in the case of noisy sound, our model outperforms the method that grounding video with ASR transcripts, further demonstrating the effectiveness of our curriculum strategy.
翻译:在本文中,我们引入了新的任务,即语音视频定位(SVG),目的是将口语描述所希望的视频片段本地化。与使用文本相比,使用音频要求该模型直接利用与生话视频有关的有用电话和音响。此外,我们随机将环境噪音添加到该语音音频中,进一步增加了这项任务的困难,并更好地模拟真实应用。为了纠正歧视性电话和从噪音音频中提取与视频有关的信息,我们在音频预培训过程中开发了一个新颖的视频指导课程学习(VGCL),它可以利用至关重要的视频课程理解来帮助理解口语并抑制外部噪音。在推断中,该模型无法直接利用地面真话视频片段。我们设计了一个课程战略战略,将输入的视频视频从地面真话逐渐转换到培训前的整个视频内容。最后,该模型可以学习如何从整个视频剪辑中提取关键视频信息,帮助理解口头语言。此外,我们收集了以活动网为基础的第一部大型语音视频地面数据集,这是我们提议的地面演练过程,可以大大地展示我们的拟议地面演练。