Considering that acoustic scenes and sound events are closely related to each other, in some previous papers, a joint analysis of acoustic scenes and sound events utilizing multitask learning (MTL)-based neural networks was proposed. In conventional methods, a strongly supervised scheme is applied to sound event detection in MTL models, which requires strong labels of sound events in model training; however, annotating strong event labels is quite time-consuming. In this paper, we thus propose a method for the joint analysis of acoustic scenes and sound events based on the MTL framework with weak labels of sound events. In particular, in the proposed method, we introduce the multiple-instance learning scheme for weakly supervised training of sound event detection and evaluate four pooling functions, namely, max pooling, average pooling, exponential softmax pooling, and attention pooling. Experimental results obtained using parts of the TUT Acoustic Scenes 2016/2017 and TUT Sound Events 2016/2017 datasets show that the proposed MTL-based method with weak labels outperforms the conventional single-task-based scene classification and event detection models with weak labels in terms of both the scene classification and event detection performances.
翻译:考虑到声场和声音事件彼此密切相关,在以前的一些论文中,提出了利用多任务学习(MTL)神经网络对声场和声音事件进行联合分析的建议;在常规方法中,对MTL模型的音效事件探测应用了严格监督的计划,这要求在示范培训中对声学事件进行强有力的标签;然而,加注强烈事件标签非常费时;因此,在本文件中,我们提议了一种方法,在MTL框架的基础上,对声场和声音事件进行联合分析,同时贴有声音事件微弱标签;特别是,在拟议方法中,我们引入了多系统学习计划,用于对声音事件探测进行监管不力的培训,并评估四种集合功能,即最大集中、平均集中、指数软体集合和集中注意;使用TUT声学屏幕部分2016/2017和TUT声音事件2016/2017的实验结果显示,拟议的以弱标签为基础的MTL方法比常规单项现场分类和事件探测模型,在现场的分类和事件性能方面都比常规的单项分类和检测。