Automatic target sound extraction (TSE) is a machine learning approach to mimic the human auditory perception capability of attending to a sound source of interest from a mixture of sources. It often uses a model conditioned on a fixed form of target sound clues, such as a sound class label, which limits the ways in which users can interact with the model to specify the target sounds. To leverage variable number of clues cross modalities available in the inference phase, including a video, a sound event class, and a text caption, we propose a unified transformer-based TSE model architecture, where a multi-clue attention module integrates all the clues across the modalities. Since there is no off-the-shelf benchmark to evaluate our proposed approach, we build a dataset based on public corpora, Audioset and AudioCaps. Experimental results for seen and unseen target-sound evaluation sets show that our proposed TSE model can effectively deal with a varying number of clues which improves the TSE performance and robustness against partially compromised clues.
翻译:自动目标声音提取( TSE) 是一种模拟人类听觉感知能力的机能学习方法, 模拟人类听觉感知能力, 从各种来源的混合中获取良好的兴趣来源。 它经常使用一种模型, 以固定形式的目标声音线索为条件, 例如一个健全的类标签, 限制用户与模型互动的方式以指定目标声音 。 要利用在推论阶段可用的不同数量的线索交叉模式, 包括一个视频、 一个声音事件类和一个文字字幕, 我们建议一个统一的基于变压器的 TSE 模型结构, 将所有模式的线索整合在一起。 由于没有现成的基准来评估我们提议的方法, 我们建立一个基于公共构件、 音频和音频卡的数据集。 实验结果显示, 我们提议的 TSE 模型可以有效地处理不同数量的线索, 改善 TSE 的性能和稳健性, 防止部分损坏的线索 。</s>