Audio captioning is an important research area that aims to generate meaningful descriptions for audio clips. Most of the existing research extracts acoustic features of audio clips as input to encoder-decoder and transformer architectures to produce the captions in a sequence-to-sequence manner. Due to data insufficiency and the architecture's inadequate learning capacity, additional information is needed to generate natural language sentences, as well as acoustic features. To address these problems, an encoder-decoder architecture is proposed that learns from both acoustic features and extracted audio event labels as inputs. The proposed model is based on pre-trained acoustic features and audio event detection. Various experiments used different acoustic features, word embedding models, audio event label extraction methods, and implementation configurations to show which combinations have better performance on the audio captioning task. Results of the extensive experiments on multiple datasets show that using audio event labels with the acoustic features improves the recognition performance and the proposed method either outperforms or achieves competitive results with the state-of-the-art models.
翻译:音频字幕是一个重要的研究领域,目的是为音效剪辑提供有意义的描述。大多数现有研究提取音频剪辑的声学特征,作为编码器解码器和变压器结构的投入,以顺序顺序顺序方式制作字幕。由于数据不足和结构学习能力不足,需要额外信息来生成自然语言句子和声学特征。为解决这些问题,提议了一个编码解码器结构,既学习音频特征,又提取音频事件标签作为投入。提议的模型以预先训练的音频特征和音频事件探测为基础。各种实验使用了不同的声学特征、文字嵌入模型、音频事件标签提取方法和实施配置,以显示哪些组合在音频字幕任务上表现更好。对多个数据集的广泛实验结果表明,使用音频事件标签和声学特征可以提高识别性,而拟议方法则可以超越或取得与最先进的模型相比的竞争性结果。