Target sound extraction (TSE) aims to extract the sound part of a target sound event class from a mixture audio with multiple sound events. The previous works mainly focus on the problems of weakly-labelled data, jointly learning and new classes, however, no one cares about the onset and offset times of the target sound event, which has been emphasized in the auditory scene analysis. In this paper, we study to utilize such timestamp information to help extract the target sound via a target sound detection network and a target-weighted time-frequency loss function. More specifically, we use the detection result of a target sound detection (TSD) network as the additional information to guide the learning of target sound extraction network. We also find that the result of TSE can further improve the performance of the TSD network, so that a mutual learning framework of the target sound detection and extraction is proposed. In addition, a target-weighted time-frequency loss function is designed to pay more attention to the temporal regions of the target sound during training. Experimental results on the synthesized data generated from the Freesound Datasets show that our proposed method can significantly improve the performance of TSE.
翻译:目标声音提取 (TSE) 旨在从混合音频和多个音频事件的混合音频中提取目标声音事件类的音频部分,先前的工作主要侧重于标签不高的数据、联合学习和新类的问题,然而,没有人关心目标声音事件的开始和抵消时间,这一点在监听场分析中得到了强调。在本文件中,我们研究利用这种时间戳信息,帮助通过目标声音探测网络和目标加权时间频率损失功能来提取目标声音。更具体地说,我们利用目标声音探测(TSD)网络的探测结果作为补充信息来指导目标声音提取网络的学习。我们还发现,TSE的结果可以进一步改善目标声音检测和提取的相互学习框架。此外,设计一个目标加权时间频率损失功能,目的是在培训中更多地关注目标声音的时间区域。从自由声数据集中生成的合成数据实验结果显示,我们提出的方法可以大大改进TSD网络的性能。