An objective understanding of media depictions, such as inclusive portrayals of how much someone is heard and seen on screen such as in film and television, requires the machines to discern automatically who, when, how, and where someone is talking, and not. Speaker activity can be automatically discerned from the rich multimodal information present in the media content. This is however a challenging problem due to the vast variety and contextual variability in the media content, and the lack of labeled data. In this work, we present a cross-modal neural network for learning visual representations, which have implicit information pertaining to the spatial location of a speaker in the visual frames. Avoiding the need for manual annotations for active speakers in visual frames, acquiring of which is very expensive, we present a weakly supervised system for the task of localizing active speakers in movie content. We use the learned cross-modal visual representations, and provide weak supervision from movie subtitles acting as a proxy for voice activity, thus requiring no manual annotations. We evaluate the performance of the proposed system on the AVA active speaker dataset and demonstrate the effectiveness of the cross-modal embeddings for localizing active speakers in comparison to fully supervised systems. We also demonstrate state-of-the-art performance for the task of voice activity detection in an audio-visual framework, especially when speech is accompanied by noise and music.
翻译:对媒体描述的客观理解,例如对电影和电视等屏幕上听到和看到的人的包容性描述,要求机器自动辨别谁、何时、如何、何地、何地在讲话,而不是。议长活动可以自动从媒体内容中丰富的多式信息中看出。然而,由于媒体内容的多样性和背景差异很大,以及缺乏标签数据,这是一个具有挑战性的问题。在这项工作中,我们展示了一个跨模式神经网络,用于学习视觉演示,其中隐含了与视觉框中发言者空间位置有关的信息。避免了在视觉框中为活跃的发言者提供手工说明的必要性,而获得这些说明的费用非常昂贵,我们展示了在电影内容中将活跃的演讲者本地化的任务受到监管薄弱的系统。我们利用了学习到的跨模式的视觉演示,对作为声音活动的代理的电影字幕进行了监管不力,因此不需要人工说明。我们评估了AVA积极演讲数据集中的拟议系统的表现,并展示了在与完全监督的语音演示活动中,将现场活跃的积极演讲者交叉嵌入的系统的有效性。我们还展示了与完全监督的语音演示的音频-演示活动。