In Psychology, actions are paramount for humans to identify sound events. In Machine Learning (ML), action recognition achieves high accuracy; however, it has not been asked whether identifying actions can benefit Sound Event Classification (SEC), as opposed to mapping the audio directly to a sound event. Therefore, we propose a new Psychology-inspired approach for SEC that includes identification of actions via human listeners. To achieve this goal, we used crowdsourcing to have listeners identify 20 actions that in isolation or in combination may have produced any of the 50 sound events in the well-studied dataset ESC-50. The resulting annotations for each audio recording relate actions to a database of sound events for the first time. The annotations were used to create semantic representations called Action Vectors (AVs). We evaluated SEC by comparing the AVs with two types of audio features -- log-mel spectrograms and state-of-the-art audio embeddings. Because audio features and AVs capture different abstractions of the acoustic content, we combined them and achieved one of the highest reported accuracies (88%).
翻译:在心理学中,行动是人类识别声音事件的首要因素。在机器学习(ML)中,行动识别达到高度准确性;然而,没有询问确定行动是否有利于健康事件分类(SEC),而不是将声音直接映射到声音事件。因此,我们为SEC提出一种新的具有心理学启发性的方法,其中包括通过人类听众识别行动。为了实现这一目标,我们利用众包让听众识别20个行动,在单独或合并的情况下,在经过仔细研究的数据集ESC-50中,可能产生了50个声音事件中的任何一个。每部录音记录产生的说明与第一次声音事件数据库有关。这些说明被用来创建称为“行动矢量”(Action Victors)的语义表达方式。我们通过将AV与两类声音特征 -- -- log-mel光谱和状态-艺术音频嵌嵌入器 -- 进行比较,对SEC进行了评估。因为音频特征和AV收集了声音内容的不同抽象性,我们将其合并并实现了报告的最高缩数之一(88%)。