Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on over 2M tracks from YouTube videos and encompassing over 500 sound classes. However, AudioSet is not an open dataset as its official release consists of pre-computed audio features. Downloading the original audio tracks can be problematic due to YouTube videos gradually disappearing and usage rights issues. To provide an alternative benchmark dataset and thus foster SER research, we introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms). We provide a detailed description of the FSD50K creation process, tailored to the particularities of Freesound data, including challenges encountered and solutions adopted. We include a comprehensive dataset characterization along with discussion of limitations and key factors to allow its audio-informed usage. Finally, we conduct sound event classification experiments to provide baseline systems as well as insight on the main factors to consider when splitting Freesound audio data for SER. Our goal is to develop a dataset to be widely adopted by the community as a new open benchmark for SER research.
翻译:用于声音事件识别(SER)的大多数现有数据集相对较小和/或具体领域,但AudioSet除外,它基于YouTube视频的2M音轨,包含500多个音频类。然而,AudioSet不是一个开放的数据集,因为其正式发布由预先配置的音频功能组成。由于YouTube视频逐渐消失和使用权利问题,下载原始音频跟踪可能会有问题。为了提供一个替代基准数据集,从而促进SER研究,我们引入FSD50K,这是一个开放数据集,包含51k以上音频视频剪辑,总共100多小时,由AudioSet Ontology所抽取的200个音频手动标签。这些音频剪是根据CreativeCommons许可证获得许可的,使数据设置自由分配(包括波形)的。我们详细描述了FSD50K创建过程,这是针对Freesound数据的特殊性,包括遇到的挑战和采用的解决办法。我们包括全面的数据集描述,同时讨论限制和关键因素,以便其音频知情使用。最后,我们进行音频事件分类实验,以提供基线系统实验,以提供核心数据,以便根据主要因素对目标进行深入了解。当我们制定新的数据时,以便将数据进行。