Audio event detection is a widely studied audio processing task, with applications ranging from self-driving cars to healthcare. In-the-wild datasets such as Audioset have propelled research in this field. However, many efforts typically involve manual annotation and verification, which is expensive to perform at scale. Movies depict various real-life and fictional scenarios which makes them a rich resource for mining a wide-range of audio events. In this work, we present a dataset of audio events called Subtitle-Aligned Movie Sounds (SAM-S). We use publicly-available closed-caption transcripts to automatically mine over 110K audio events from 430 movies. We identify three dimensions to categorize audio events: sound, source, quality, and present the steps involved to produce a final taxonomy of 245 sounds. We discuss the choices involved in generating the taxonomy, and also highlight the human-centered nature of sounds in our dataset. We establish a baseline performance for audio-only sound classification of 34.76% mean average precision and show that incorporating visual information can further improve the performance by about 5%. Data and code are made available for research at https://github.com/usc-sail/mica-subtitle-aligned-movie-sounds
翻译:音频事件探测是一项广泛研究的音频处理任务,其应用范围从自行驾驶汽车到医疗保健等。音频赛等动态数据集推动了这一领域的研究。然而,许多工作通常涉及人工批注和核实,这在规模上是昂贵的。电影描绘了各种真实和虚构的情景,使他们成为开采广泛音频事件的丰富资源。在这项工作中,我们展示了一个音频事件数据集,名为“子标题统一电影音频(SAM-S)”。我们使用公开可得的闭场录音誊本自动将430部电影中的110公里以上的音频事件埋设地雷。我们确定了将音频事件分类的三个层面:声音、来源、质量和介绍制作245个声音最后分类的步骤。我们讨论了生成分类学所涉及的选择,并突出了我们数据集中声音的以人为本的性质。我们为只收音频的音频平均音频分类设定了一个基线性性能,即平均精确度为34.76%。我们指出,纳入视觉信息可以进一步提高性能,大约5%。在 https://gial-gustomas-s-arviews-arview