We introduce a state-of-the-art audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos. We identify limitations of previous work on audiovisual on-screen sound separation, including the simplicity and coarse resolution of spatio-temporal attention, and poor convergence of the audio separation model. Our proposed model addresses these issues using cross-modal and self-attention modules that capture audio-visual dependencies at a finer resolution over time, and by unsupervised pre-training of audio separation model. These improvements allow the model to generalize to a much wider set of unseen videos. For evaluation and semi-supervised training, we collected human annotations of on-screen audio from a large database of in-the-wild videos (YFCC100M). Our results show marked improvements in on-screen separation performance, in more general conditions than previous methods.
翻译:我们引入了最先进的视听屏幕声音分离系统,该系统能够通过观看现场视频,学习将声音与屏幕上物体分离,并将声音与屏幕上物体联系起来;我们确定了以往视听屏幕声音分离工作的局限性,包括时空注意力的简单和粗略解析,以及音频分离模式的趋同性差;我们提议的模型利用跨模式和自省模块来解决这些问题,这些模块在时间上以细小的分辨率记录视听依赖性,以及未经监督的音频分离模型培训前,这些改进使得该模型能够推广到范围更广的一组看不见视频;为了评估和半监督培训,我们从大型视频数据库(YFCC100M)收集了屏幕上声音的人文说明;我们的结果显示,与以往相比,屏幕上分离的性能在较一般的条件下明显改进。