In this work, we propose to study the performance of a model trained with a sentence embedding regression loss component for the Automated Audio Captioning task. This task aims to build systems that can describe audio content with a single sentence written in natural language. Most systems are trained with the standard Cross-Entropy loss, which does not take into account the semantic closeness of the sentence. We found that adding a sentence embedding loss term reduces overfitting, but also increased SPIDEr from 0.397 to 0.418 in our first setting on the AudioCaps corpus. When we increased the weight decay value, we found our model to be much closer to the current state-of-the-art methods, with a SPIDEr score up to 0.444 compared to a 0.475 score. Moreover, this model uses eight times less trainable parameters. In this training setting, the sentence embedding loss has no more impact on the model performance.
翻译:在本研究中,我们提出了一种使用句子嵌入回归损失分量训练的模型,用于自动音频字幕生成任务。该任务旨在构建可以用单个自然语言句子描述音频内容的系统。大多数系统使用标准的交叉熵损失进行训练,无法考虑到句子的语义相似度。我们发现添加句子嵌入损失项可以减少过拟合,但在AudioCaps数据集上第一次设置中,也将SPIDEr从0.397增加到0.418。当我们增加权重衰减值时,我们发现我们的模型与当前最先进的方法更接近,SPIDEr得分高达0.444,而比0.475分差距较小。此外,该模型使用的训练参数少了八倍。在这种训练设置中,句子嵌入损失不再对模型性能产生影响。