Automated audio captioning (AAC) aims to describe audio data with captions using natural language. Most existing AAC methods adopt an encoder-decoder structure, where the attention based mechanism is a popular choice in the decoder (e.g., Transformer decoder) for predicting captions from audio features. Such attention based decoders can capture the global information from the audio features, however, their ability in extracting local information can be limited, which may lead to degraded quality in the generated captions. In this paper, we present an AAC method with an attention-free decoder, where an encoder based on PANNs is employed for audio feature extraction, and the attention-free decoder is designed to introduce local information. The proposed method enables the effective use of both global and local information from audio signals. Experiments show that our method outperforms the state-of-the-art methods with the standard attention based decoder in Task 6 of the DCASE 2021 Challenge.
翻译:自动音频字幕(AAC)旨在用自然语言描述音频数据,大多数现有AAC方法采用编码器解码器结构,其中关注机制是音频特性中预测字幕的流行选择(例如变换器解码器),这种关注解码器可以从音频特性中捕捉全球信息,然而,它们提取本地信息的能力可能有限,可能导致生成字幕质量下降。在本文中,我们介绍了一种无注意解码器的AAC方法,其中以PANNs为主的编码器用于音频特征提取,而无注意解码器旨在引入本地信息。拟议方法有助于有效利用来自音频信号的全球和地方信息。实验显示,我们的方法超过了以DCASE 2021 挑战任务6基于标准关注的解码器。