Audio captioning aims to generate text descriptions from environmental sounds. One challenge of audio captioning is the difficulty of the generalization due to the lack of audio-text paired training data. In this work, we propose a simple yet effective method of dealing with small-scaled datasets by leveraging a pre-trained language model. We keep the language model frozen to maintain the expressivity for text generation, and we only learn to extract global and temporal features from the input audio. To bridge a modality gap between the audio features and the language model, we employ mapping networks that translate audio features to the continuous vectors the language model can understand, called prefixes. We evaluate our proposed method on the Clotho and AudioCaps dataset and show our method outperforms prior arts in diverse experimental settings.
翻译:音频字幕生成旨在从环境声音中生成文本描述。其中一个难题是由于缺乏音频-文本配对训练数据而难以实现泛化。在本研究中,我们提出了一种简单而有效的方法来处理小规模数据集,即利用预训练的语言模型。我们保持语言模型不变以维持文本生成的表达性,而只学习从输入音频中提取全局和时态特征的方法。为了弥合音频特征和语言模型之间的模态差距,我们采用映射网络将音频特征转换为连续向量,称为前缀,以便语言模型理解。我们在Clotho和AudioCaps数据集上评估了我们的方法,并在不同的实验设置中展示了我们的方法优于现有技术。