Human beings can perceive a target sound type from a multi-source mixture signal by the selective auditory attention, however, such functionality was hardly ever explored in machine hearing. This paper addresses the target sound detection (TSD) task, which aims to detect the target sound signal from a mixture audio when a target sound's reference audio is given. We present a novel target sound detection network (TSDNet) which consists of two main parts: A conditional network which aims at generating a sound-discriminative conditional embedding vector representing the target sound, and a detection network which takes both the mixture audio and the conditional embedding vector as inputs and produces the detection result of the target sound. These two networks can be jointly optimized with a multi-task learning approach to further improve the performance. In addition, we study both strong-supervised and weakly-supervised strategies to train TSDNet and propose a data augmentation method by mixing two samples. To facilitate this research, we build a target sound detection dataset (\textit{i.e.} URBAN-TSD) based on URBAN-SED and UrbanSound8K datasets, and experimental results indicate our method could get the segment-based F scores of 76.3$\%$ and 56.8$\%$ on the strongly-labelled and weakly-labelled data respectively.
翻译:人类可以通过有选择的听觉注意到,从多源混合信号中看到目标声音类型,但这种功能在机器听觉中几乎从未探索过。本文涉及目标声音探测(TSD)任务,目的是在提供目标声音参考音频时从混合音频中探测目标声音信号。我们提出了一个由两个主要部分组成的新的目标声音探测网络(TSDNet),它由两个主要部分组成:一个有条件的网络,目的是产生一种代表目标声音的有声分辨的有条件嵌入矢量,一个检测网络,将混合物音频和有条件嵌入矢量作为投入,并产生目标声音的检测结果。这两个网络可以与多任务学习方法共同优化,以进一步改进性能。此外,我们还研究一个强力监控和弱力监控战略,以培训TSDNet,并通过混合两个样本提出数据增强数据的方法。为了便利这一研究,我们根据URBAN-SED$和城市SoundQQR8的低值数据部分和实验结果,可以有力地显示我们分别以76美元和城市SoundQQQ的标签数据评分数。