A key function of auditory cognition is the association of characteristic sounds with their corresponding semantics over time. Humans attempting to discriminate between fine-grained audio categories, often replay the same discriminative sounds to increase their prediction confidence. We propose an end-to-end attention-based architecture that through selective repetition attends over the most discriminative sounds across the audio sequence. Our model initially uses the full audio sequence and iteratively refines the temporal segments replayed based on slot attention. At each playback, the selected segments are replayed using a smaller hop length which represents higher resolution features within these segments. We show that our method can consistently achieve state-of-the-art performance across three audio-classification benchmarks: AudioSet, VGG-Sound, and EPIC-KITCHENS-100.
翻译:听觉认知的一个关键功能是将特质声音和长期与其相应的语义学联系起来。 人类试图区分精细的音频类别,常常重复同样的歧视性声音来增加他们的预测信心。 我们提出一个端对端关注结构,通过有选择的重复来覆盖整个音频序列中最具歧视性的声音。 我们的模型最初使用完整的音频序列, 并反复根据时段注意情况对时间段进行调整。 每次播放后, 所选部分都使用一个较小的跳长重弹重弹, 它代表着这些部分中更高的分辨率特征。 我们显示, 我们的方法可以始终在三个音频级基准( 音频Set、 VGG- Sound 和 EPIC- KITCHENS-100) 上实现最先进的表现。