Current audio-visual separation methods share a standard architecture design where an audio encoder-decoder network is fused with visual encoding features at the encoder bottleneck. This design confounds the learning of multi-modal feature encoding with robust sound decoding for audio separation. To generalize to a new instrument: one must finetune the entire visual and audio network for all musical instruments. We re-formulate visual-sound separation task and propose Instrument as Query (iQuery) with a flexible query expansion mechanism. Our approach ensures cross-modal consistency and cross-instrument disentanglement. We utilize "visually named" queries to initiate the learning of audio queries and use cross-modal attention to remove potential sound source interference at the estimated waveforms. To generalize to a new instrument or event class, drawing inspiration from the text-prompt design, we insert an additional query as an audio prompt while freezing the attention mechanism. Experimental results on three benchmarks demonstrate that our iQuery improves audio-visual sound source separation performance.
翻译:当前视听分离方法共享标准架构设计, 将音频编码器- 解码器网络与编码器瓶颈的视觉编码功能连接起来。 此设计混淆了多式特征编码的学习, 并严格解码音频分离。 要概括新仪器: 必须对所有音乐仪器的整个视觉和音频网络进行微调。 我们重新配置视觉- 声学分离任务, 并以灵活的查询扩展机制( Query) 提出工具 。 我们的方法确保了跨模式的一致性和交叉结构解析。 我们使用“ 视觉命名” 查询来启动音频查询的学习, 并使用交叉模式关注来清除估计波形上的潜在声音源干扰。 要概括一个新的仪器或事件类, 从文本- Prompt 设计中得到灵感, 我们插入一个额外的查询作为音频提示, 冻结关注机制 。 三个基准的实验结果显示, 我们的iQuer 改进了音频源分离功能 。