Automatic Audio Captioning (AAC) is the task that aims to describe an audio signal using natural language. AAC systems take as input an audio signal and output a free-form text sentence, called a caption. Evaluating such systems is not trivial, since there are many ways to express the same idea. For this reason, several complementary metrics, such as BLEU, CIDEr, SPICE and SPIDEr, are used to compare a single automatic caption to one or several captions of reference, produced by a human annotator. Nevertheless, an automatic system can produce several caption candidates, either using some randomness in the sentence generation process, or by considering the various competing hypothesized captions during decoding with beam-search, for instance. If we consider an end-user of an AAC system, presenting several captions instead of a single one seems relevant to provide some diversity, similarly to information retrieval systems. In this work, we explore the possibility to consider several predicted captions in the evaluation process instead of one. For this purpose, we propose SPIDEr-max, a metric that takes the maximum SPIDEr value among the scores of several caption candidates. To advocate for our metric, we report experiments on Clotho v2.1 and AudioCaps, with a transformed-based system. On AudioCaps for example, this system reached a SPIDEr-max value (with 5 candidates) close to the SPIDEr human score of reference.
翻译:自动自动听觉( AAC) 是用来用自然语言描述音频信号的任务。 AAC 系统可以输入一个音频信号并输出一个称为标题的免费文本句。 评估这些系统并非微不足道, 因为有许多方法可以表达同样的想法。 因此, 使用几个补充性衡量标准, 如 BLEU、 CIDER、 SPICE 和 SPIDER 等, 来将单个自动字幕与一个或数个参考标题进行比较, 由人类标注员制作。 然而, 一个自动系统可以生成几个标题候选人, 要么在句子生成过程中使用某种随机性, 要么在用梁搜索解码解码时考虑各种相竞的虚伪标题。 如果我们考虑AAC 系统的最终用户, 提出几个说明而不是单一说明, 来提供某种多样性, 类似于信息检索系统。 在评估过程中, 我们建议 SPIDER Max, 将一个具有最高 SPIDER 值的参数, 用来在以 SP SS 标准候选人 中进行最接近的 SPIDR 。