声音定位的音视频聚类网络 (Audio-Visual Grouping Network for Sound Localization from Mixtures)

Sound source localization is a typical and challenging task that predicts the location of sound sources in a video. Previous single-source methods mainly used the audio-visual association as clues to localize sounding objects in each image. Due to the mixed property of multiple sound sources in the original space, there exist rare multi-source approaches to localizing multiple sources simultaneously, except for one recent work using a contrastive random walk in the graph with images and separated sound as nodes. Despite their promising performance, they can only handle a fixed number of sources, and they cannot learn compact class-aware representations for individual sources. To alleviate this shortcoming, in this paper, we propose a novel audio-visual grouping network, namely AVGN, that can directly learn category-wise semantic features for each source from the input audio mixture and image to localize multiple sources simultaneously. Specifically, our AVGN leverages learnable audio-visual class tokens to aggregate class-aware source features. Then, the aggregated semantic features for each source can be used as guidance to localize the corresponding visual regions. Compared to existing multi-source methods, our new framework can localize a flexible number of sources and disentangle category-aware audio-visual representations for individual sound sources. We conduct extensive experiments on MUSIC, VGGSound-Instruments, and VGG-Sound Sources benchmarks. The results demonstrate that the proposed AVGN can achieve state-of-the-art sounding object localization performance on both single-source and multi-source scenarios. Code is available at \url{https://github.com/stoneMo/AVGN}.

翻译：声源定位是一项典型而具有挑战性的任务，可以预测视频中声源的位置。以往的单源方法主要利用音视频关联作为定位每张图像中声音对象的线索。由于原始空间中存在多个声音来源的混音特性，很少有能够同时定位多个声音来源的多源方法，除了最近一篇采用随机对比式漫步来处理图像和分离声音作为节点的图形方法。尽管效果很好，但它们只能处理固定数量的来源，并且无法学习简明的单个来源的类别感知表示。为了缓解这个缺点，本文提出了一种新型的音视频聚类网络AVGN，它可以直接从输入音频混合物和图像中学习每个来源的类别语义特征以同时定位多个来源。具体来说，我们的AVGN利用可学习的音视频类令牌来聚合每个来源的类别感知特征。然后，可以使用每个来源的聚合语义特征作为指导来定位对应的视觉区域。与现有的多源方法相比，我们的新框架可以定位可变数量的源，并可以解开单个声源的类别感知音视频表示。我们在MUSIC、VGGSound-Instruments和VGG-Sound Sources基准测试上进行了广泛的实验。结果表明，提出的AVGN在单源和多源情况下均可达到最先进的声音对象定位性能。代码可在\url{https://github.com/stoneMo/AVGN} 中获取。