Simultaneous speech translation (SimulST) systems aim at generating their output with the lowest possible latency, which is normally computed in terms of Average Lagging (AL). In this paper we highlight that, despite its widespread adoption, AL provides underestimated scores for systems that generate longer predictions compared to the corresponding references. We also show that this problem has practical relevance, as recent SimulST systems have indeed a tendency to over-generate. As a solution, we propose LAAL (Length-Adaptive Average Lagging), a modified version of the metric that takes into account the over-generation phenomenon and allows for unbiased evaluation of both under-/over-generating systems.
翻译:同时的语音翻译系统(SimulST)旨在以尽可能低的延迟时间生成其产出,通常用平均拖拉(AL)计算。本文我们强调,尽管普遍采用,AL为产生比相应参考时间更长预测的系统提供了被低估的分数。我们还表明,这个问题具有实际意义,因为最近的SimulST系统确实有超发倾向。作为一种解决办法,我们建议LAAL(Length-Adaptapive 平均拖拉),这是考虑到超代现象并允许对低生成/超生成系统进行公正评估的经修改的衡量标准版本。