In this paper, we propose an effective sound event detection (SED) method based on the audio spectrogram transformer (AST) model, pretrained on the large-scale AudioSet for audio tagging (AT) task, termed AST-SED. Pretrained AST models have recently shown promise on DCASE2022 challenge task4 where they help mitigate a lack of sufficient real annotated data. However, mainly due to differences between the AT and SED tasks, it is suboptimal to directly utilize outputs from a pretrained AST model. Hence the proposed AST-SED adopts an encoder-decoder architecture to enable effective and efficient fine-tuning without needing to redesign or retrain the AST model. Specifically, the Frequency-wise Transformer Encoder (FTE) consists of transformers with self attention along the frequency axis to address multiple overlapped audio events issue in a single clip. The Local Gated Recurrent Units Decoder (LGD) consists of nearest-neighbor interpolation (NNI) and Bidirectional Gated Recurrent Units (Bi-GRU) to compensate for temporal resolution loss in the pretrained AST model output. Experimental results on DCASE2022 task4 development set have demonstrated the superiority of the proposed AST-SED with FTE-LGD architecture. Specifically, the Event-Based F1-score (EB-F1) of 59.60% and Polyphonic Sound detection Score scenario1 (PSDS1) score of 0.5140 significantly outperform CRNN and other pretrained AST-based systems.
翻译:在本文中,我们提出一种基于音频光谱变压器(AST)模型的有效事件探测(SED)方法,该模型在大规模音频标签(AT)任务(AST-SED)的大规模音频Set(AT)任务(AST-SED)前经过预先培训,因此,拟议的AST-SED采用一个对大型音频标记(AT)任务(AST)的高级音频Set(SED)模型,称为AST-ST-SED。AST模型最近对DCASE20任务任务任务任务任务任务任务4 展示了前景,在其中,AASE20任务任务任务任务任务任务任务任务之间出现了差异,帮助缓解了多个重叠的音频事件。主要由于AT和SED任务任务任务之间的差别,直接利用ASTA模型(NNIP)和B-GEDO 任务级计算模型(B-GFRU)的有效和高效的微调调整。具体来说,A-S-SDA级任务任务任务任务任务阶段A-TA-I 任务任务任务任务任务阶段的升级结构,已经演示了A-S-I-SDA-SDA-S-I-SBA-IL 的透明化。</s>