音频库V2:校准开放域域内声频分离的视听关注结构 (AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation)

We introduce AudioScopeV2, a state-of-the-art universal audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos. We identify several limitations of previous work on audio-visual on-screen sound separation, including the coarse resolution of spatio-temporal attention, poor convergence of the audio separation model, limited variety in training and evaluation data, and failure to account for the trade off between preservation of on-screen sounds and suppression of off-screen sounds. We provide solutions to all of these issues. Our proposed cross-modal and self-attention network architectures capture audio-visual dependencies at a finer resolution over time, and we also propose efficient separable variants that are capable of scaling to longer videos without sacrificing much performance. We also find that pre-training the separation model only on audio greatly improves results. For training and evaluation, we collected new human annotations of onscreen sounds from a large database of in-the-wild videos (YFCC100M). This new dataset is more diverse and challenging. Finally, we propose a calibration procedure that allows exact tuning of on-screen reconstruction versus off-screen suppression, which greatly simplifies comparing performance between models with different operating points. Overall, our experimental results show marked improvements in on-screen separation performance under much more general conditions than previous methods with minimal additional computational complexity.

翻译：我们引入了AudioScopeV2, 这是一种最先进的通用通用通用视听屏幕声音隔离系统,它能够学习将声音分离,并通过查看屏幕上的视频将声音与屏幕上对象联系起来。我们找出了以前在屏幕上视听声音分离工作的若干局限性,包括平时注意力的粗略解析,音频分离模式的趋同性差,培训和评价数据的多样性有限,以及未能说明在屏幕上声音的保存与屏外声的抑制之间的平衡。我们为所有这些问题提供了解决办法。我们提议的跨模式和自我关注网络结构在一段时间内以更细的分辨率捕捉视听依赖性,我们还提出了高效的隐喻变体,这些变体能够在不牺牲大量性能的情况下将视频放大为更长的视频。我们还发现,对分离模式的预先培训只能极大地改进了音频效果。关于培训和评价,我们从一个大型的屏幕最低功能数据库(YFCC100M)收集了屏幕上声音的新的人文说明。这个新的数据配置和自我注意的网络结构在一段时间内捕捉取视听的细度上捕捉取了视听的细微的改进结果。最后,在进行精确的升级和升级工作上提出了一种不同程度的模拟的改进,在不断的模拟中,在不断改进的模拟中提出了一种不同的的改进过程的改进,在不断的改进中,在不断改进的改进的模拟的改进的改进的改进的改进的改进的改进的改进的改进的改进和反向下进行。最后的改进工作上提出了一种不同的的改进。