Target sound detection (TSD) aims to detect the target sound from a mixture audio given the reference information. Previous methods use a conditional network to extract a sound-discriminative embedding from the reference audio, and then use it to detect the target sound from the mixture audio. However, the network performs much differently when using different reference audios (e.g. performs poorly for noisy and short-duration reference audios), and tends to make wrong decisions for transient events (i.e. shorter than $1$ second). To overcome these problems, in this paper, we present a reference-aware and duration-robust network (RaDur) for TSD. More specifically, in order to make the network more aware of the reference information, we propose an embedding enhancement module to take into account the mixture audio while generating the embedding, and apply the attention pooling to enhance the features of target sound-related frames and weaken the features of noisy frames. In addition, a duration-robust focal loss is proposed to help model different-duration events. To evaluate our method, we build two TSD datasets based on UrbanSound and Audioset. Extensive experiments show the effectiveness of our methods.
翻译:目标声音探测( TSD) 旨在检测来自混合音频的目标声音, 并参考信息 。 先前的方法是使用一个有条件的网络从参考音频中提取一个声音分解嵌入, 然后用它来检测混合音频的目标声音。 但是, 网络使用不同的参考音频( 例如, 噪音和短时间隔参考音频表现差) 时表现的差别很大, 并且往往会为瞬时事件做出错误的决定( 短于 $1 秒 ) 。 为了克服这些问题, 我们在本文件中为 TED 提出了一个参考- 觉和 期限- robust 网络( RADur ) 。 更具体地说, 为了让网络更了解参考信息, 我们提议了一个嵌入增强模块, 以考虑到混合物音频同时生成嵌入, 并且将注意力集中到加强目标声频框架的特征并削弱噪音框架的特征 。 此外, 提议使用一个期限- robust 焦点损失来帮助模拟不同宽度事件。 为了评估我们的方法, 我们建两个基于 UrsuramSoundset 的 实验方法的 。