Models based on diverse attention mechanisms have recently shined in tasks related to acoustic event classification (AEC). Among them, self-attention is often used in audio-only tasks to help the model recognize different acoustic events. Self-attention relies on the similarity between time frames, and uses global information from the whole segment to highlight specific features within a frame. In real life, information related to acoustic events will attenuate over time, which means the information within some frames around the event deserves more attention than distant time global information that may be unrelated to the event. This paper shows that self-attention may over-enhance certain segments of audio representations, and smooth out the boundaries between events representations and background noises. Hence, this paper proposes an event-related data conditioning (EDC) for AEC. EDC directly works on spectrograms. The idea of EDC is to adaptively select the frame-related attention range based on acoustic features, and gather the event-related local information to represent the frame. Experiments show that: 1) compared with spectrogram-based data augmentation methods and trainable feature weighting and self-attention, EDC outperforms them in both the original-size mode and the augmented mode; 2) EDC effectively gathers event-related local information and enhances boundaries between events and backgrounds, improving the performance of AEC.
翻译:基于不同关注机制的模型最近在与声学事件分类(AEC)有关的任务中闪亮了基于不同关注机制的模型。其中,自我关注常常用于只听音频的任务,以帮助模型识别不同的声学事件。自我关注依赖于时间框架之间的相似性,并且利用整个部分的全球信息在一个框架中突出具体特征。在现实生活中,与声学事件有关的信息将随着时间的推移而减弱,这意味着围绕活动的某些框架内的信息比可能与事件无关的遥远时间全球信息更值得更多关注。本文显示,自我关注可能过度加强某些部分的音频表达,并平滑事件表达和背景噪音之间的界限。因此,本文件建议为AEC提出一个与事件有关的数据调节(EDC)。 EDC直接在光谱图上工作。EDC的理念是根据声学特征适应性地选择与框架有关的关注范围,收集与事件有关的当地信息以代表框架。实验表明:1)与光谱数据增强方法和可训练的特征比重度以及事件与背景噪音之间的分度。因此,EDC模式和背景之间将增强原始和与EDC有关的变化模式的外观活动。