Audio captioning aims to generate text descriptions from environmental sounds. One challenge of audio captioning is the difficulty of the generalization due to the lack of audio-text paired training data. In this work, we propose a simple yet effective method of dealing with small-scaled datasets by leveraging a pre-trained language model. We keep the language model frozen to maintain the expressivity for text generation, and we only learn to extract global and temporal features from the input audio. To bridge a modality gap between the audio features and the language model, we employ mapping networks that translate audio features to the continuous vectors the language model can understand, called prefixes. We evaluate our proposed method on the Clotho and AudioCaps dataset and show our method outperforms prior arts in diverse experimental settings.
翻译:音频字幕的目的是从环境声中生成文本描述。音频字幕面临的一个挑战是由于缺乏音频-文本成对的训练数据而难以进行泛化。在这项工作中,我们提出了一种简单而有效的方法,通过利用预训练的语言模型来处理小规模数据集。我们保持语言模型冻结,以维护文本生成的表达能力,只学习从输入音频中提取全局和时间特征。为了弥合音频特征和语言模型之间的模态鸿沟,我们采用映射网络将音频特征转换为连续向量,称为前缀,这种向量语言模型可以理解。我们在Clotho和AudioCaps数据集上评估了我们的方法,并展示了我们的方法在不同的实验设置中优于现有技术。