Most of the existing isolated sound event datasets comprise a small number of sound event classes, usually 10 to 15, restricted to a small domain, such as domestic and urban sound events. In this work, we introduce GISE-51, a dataset spanning 51 isolated sound events belonging to a broad domain of event types. We also release GISE-51-Mixtures, a dataset of 5-second soundscapes with hard-labelled event boundaries synthesized from GISE-51 isolated sound events. We conduct baseline sound event recognition (SER) experiments on the GISE-51-Mixtures dataset, benchmarking prominent convolutional neural networks, and models trained with the dataset demonstrate strong transfer learning performance on existing audio recognition benchmarks. Together, GISE-51 and GISE-51-Mixtures attempt to address some of the shortcomings of recent sound event datasets, providing an open, reproducible benchmark for future research along with the freedom to adapt the included isolated sound events for domain-specific applications.
翻译:现有大多数孤立的音频事件数据集包括少量的音频活动类别,通常为10至15个,仅限于小领域,如国内和城市的音频活动。在这项工作中,我们引入了GISE-51数据集,该数据集涵盖属于事件类型广泛领域的51个孤立的音频事件。我们还发布GISE-51-Mixtures数据集,该数据集由5秒的音频场组成,由GISE-51孤立的音频事件组成。我们开展了关于GISE-51-Mixtures数据集的基线音频活动识别实验,对显著的动态神经网络进行基准基准测试,以及受数据集培训的模型,展示了在现有音频识别基准上的强有力的传输学习表现。GISE-51和GISE-51-Mixulations共同试图解决最近音频事件数据集的一些缺陷,为今后的研究提供了一个开放和可复制的基准,同时允许将包括孤立的音频活动应用于特定领域的应用。