Audio captioning quality metrics which are typically borrowed from the machine translation and image captioning areas measure the degree of overlap between predicted tokens and gold reference tokens. In this work, we consider a metric measuring semantic similarities between predicted and reference captions instead of measuring exact word overlap. We first evaluate its ability to capture similarities among captions corresponding to the same audio file and compare it to other established metrics. We then propose a fine-tuning method to directly optimize the metric by backpropagating through a sentence embedding extractor and audio captioning network. Such fine-tuning results in an improvement in predicted captions as measured by both traditional metrics and the proposed semantic similarity captioning metric.
翻译:通常从机器翻译和图像说明区借用的音频字幕质量度量标准,通常用来测量预测的象征物和黄金参考象征物之间的重叠程度。 在这项工作中,我们考虑测量预测和参考说明之间的语义相似性,而不是测量准确的文字重叠。我们首先评估其捕捉与同一音频文件相对应的字幕相似性的能力,并将其与其他既定指标进行比较。然后我们提出微调方法,通过嵌入句子的提取器和音频说明网络进行反插,直接优化计量。这种微调结果改进了以传统指标和拟议的语义相似性说明度衡量的预测说明。