Automatic Audio Captioning (AAC) refers to the task of translating audio into a natural language that describes the audio events, source of the events and their relationships. The limited samples in AAC datasets at present, has set up a trend to incorporate transfer learning with Audio Event Detection (AED) as a parent task. Towards this direction, in this paper, we propose an encoder-decoder architecture with light-weight (i.e. with lesser learnable parameters) Bi-LSTM recurrent layers for AAC and compare the performance of two state-of-the-art pre-trained AED models as embedding extractors. Our results show that an efficient AED based embedding extractor combined with temporal attention and augmentation techniques is able to surpass existing literature with computationally intensive architectures. Further, we provide evidence of the ability of the non-uniform attention weighted encoding generated as a part of our model to facilitate the decoder glance over specific sections of the audio while generating each token.
翻译:自动音频说明( AAC) 指的是将音频转换为描述音频事件、事件来源及其关系的自然语言的任务。 AAC 数据集中的有限样本目前已经形成了一种趋势,将传输学习与音频事件检测(AED)作为父任务纳入其中。 朝着这个方向,我们在本文件中提议为 AAC 建立一个具有轻量( 学习程度较低的参数) Bi- LSTM 常数层的编码器解码结构, 并将两种经过预先训练的AED 模型作为嵌入提取器的性能加以比较。 我们的结果表明, 高效的AED 嵌入提取器加上时间关注和增强技术能够用计算密集的结构超越现有文献。 此外, 我们提供了证据, 证明作为模型的一部分产生的非单式注意加权编码连接器能够促进在生成每个符号时对音频的具体部分进行解码透视。