利用多波-多波-多尺度时-时间代表制库神经网络探测多声声音事件 (Polyphonic Sound Event Detection Using Capsule Neural Network on Multi-Type-Multi-Scale Time-Frequency Representation)

The challenges of polyphonic sound event detection (PSED) stem from the detection of multiple overlapping events in a time series. Recent efforts exploit Deep Neural Networks (DNNs) on Time-Frequency Representations (TFRs) of audio clips as model inputs to mitigate such issues. However, existing solutions often rely on a single type of TFR, which causes under-utilization of input features. To this end, we propose a novel PSED framework, which incorporates Multi-Type-Multi-Scale TFRs. Our key insight is that: TFRs, which are of different types or in different scales, can reveal acoustics patterns in a complementary manner, so that the overlapped events can be best extracted by combining different TFRs. Moreover, our framework design applies a novel approach, to adaptively fuse different models and TFRs symbiotically. Hence, the overall performance can be significantly improved. We quantitatively examine the benefits of our framework by using Capsule Neural Networks, a state-of-the-art approach for PSED. The experimental results show that our method achieves a reduction of 7\% in error rate compared with the state-of-the-art solutions on the TUT-SED 2016 dataset.

翻译：多声波事件探测(PSED)的挑战来自在一个时间序列中检测到多重重叠事件。最近的努力利用了在时间-时间代表(TFRs)上的深神经网络(DNNs)作为模拟投入,以缓解这些问题。然而,现有解决方案往往依赖单一类型的TFR, 造成投入特征利用不足。为此,我们提议了一个全新的PSED框架,其中包含多- ype- Multi- 比例的TFRs。我们的主要见解是:不同类型或不同比例的TFRs能够以互补的方式揭示声学模式,这样,通过将不同的TFRs结合起来,就能最好地解析重叠事件。此外,我们的框架设计采用了一种新颖的方法,将不同的模型和TFRs相交融在一起。因此,总体绩效可以大大改善。我们通过使用Capsule Neural网络(PSEDDD的状态方法)对框架的效益进行定量审查。实验结果显示,我们的方法将2016年的7-NED解决方案与2016年的州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-