Existing audio-visual event localization (AVE) handles manually trimmed videos with only a single instance in each of them. However, this setting is unrealistic as natural videos often contain numerous audio-visual events with different categories. To better adapt to real-life applications, in this paper we focus on the task of dense-localizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video. The problem is challenging as it requires fine-grained audio-visual scene and context understanding. To tackle this problem, we introduce the first Untrimmed Audio-Visual (UnAV-100) dataset, which contains 10K untrimmed videos with over 30K audio-visual events. Each video has 2.8 audio-visual events on average, and the events are usually related to each other and might co-occur as in real-life scenes. Next, we formulate the task using a new learning-based framework, which is capable of fully integrating audio and visual modalities to localize audio-visual events with various lengths and capture dependencies between them in a single pass. Extensive experiments demonstrate the effectiveness of our method as well as the significance of multi-scale cross-modal perception and dependency modeling for this task.
翻译:现有的音频-视觉事件定位(AVE)仅处理手动修剪的视频,并且每个视频仅包含一个实例。然而,这种设置是不现实的,因为自然视频通常包含许多不同类别的音频-视觉事件。为了更好地适应实际应用,本文关注于密集定位音频-视觉事件的任务,旨在共同定位和识别发生在未剪辑视频中的所有音频-视觉事件。该问题具有挑战性,因为它需要精细的音频-视觉场景和上下文理解。为了解决这个问题,我们引入了第一个未剪辑音频-视觉(UnAV-100)数据集,它包含超过30K音频-视觉事件的10K个未剪辑视频。每个视频平均有2.8个音频-视觉事件,这些事件通常相互关联并且可能像实际场景一样共同发生。接下来,我们使用一种新的基于学习的框架来表述任务,该框架能够完全整合音频和视觉模态以定位具有各种长度的音频-视觉事件,并在单个通路中捕捉它们之间的依赖关系。广泛的实验证明了我们方法的有效性以及多尺度跨模态感知和依赖建模对于此任务的重要性。