We introduce EPIC-SOUNDS, a large-scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos. We propose an annotation pipeline where annotators temporally label distinguishable audio segments and describe the action that could have caused this sound. We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes. For actions that involve objects colliding, we collect human annotations of the materials of these objects (e.g. a glass object being placed on a wooden surface), which we verify from visual labels, discarding ambiguities. Overall, EPIC-SOUNDS includes 78.4k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2k non-categorised segments. We train and evaluate two state-of-the-art audio recognition models on our dataset, highlighting the importance of audio-only labels and the limitations of current models to recognise actions that sound.
翻译:我们引入了EPIC-SOUNDS,这是一个大型的音频说明数据集,在以自我为中心的视频的音频流中捕捉时间范围和类标签。我们建议了一种批注管道,由备注者对可识别的音频段进行时间标签,并描述可能造成这种声音的行动。我们通过将这些对音频的免费描述分组到各个类别,确定可以完全与音频区分的行动。对于涉及相撞物体的行动,我们收集这些物体材料(例如放在木质表面的玻璃物体)的人类说明,我们从视觉标签、丢弃的模糊性进行核查。总的来说,EPIC-SOUNDS包括78.4k分类的可听事件和行动部分,分布在44个类以及39.2k非分类部分。我们培训和评估了我们数据集上的两个最先进的音频识别模型,强调只使用音频标签的重要性以及当前模型在识别声音行动方面的局限性。