In recent years, machine speech-to-speech and speech-to-text translation has gained momentum thanks to advances in artificial intelligence, especially in the domains of speech recognition and machine translation. The quality of such applications is commonly tested with automatic metrics, such as BLEU, primarily with the goal of assessing improvements of releases or in the context of evaluation campaigns. However, little is known about how such systems compare to human performances in similar communicative tasks or how the performance of such systems is perceived by final users. In this paper, we present the results of an experiment aimed at evaluating the quality of a simultaneous speech translation engine by comparing it to the performance of professional interpreters. To do so, we select a framework developed for the assessment of human interpreters and use it to perform a manual evaluation on both human and machine performances. In our sample, we found better performance for the human interpreters in terms of intelligibility, while the machine performs slightly better in terms of informativeness. The limitations of the study and the possible enhancements of the chosen framework are discussed. Despite its intrinsic limitations, the use of this framework represents a first step towards a user-centric and communication-oriented methodology for evaluating simultaneous speech translation.
翻译:近年来,由于人工智能的进步,特别是语音识别和机器翻译领域的进步,机器语音和语音对文本翻译取得了势头。这些应用的质量通常用自动测量仪进行测试,如BLEU,主要目的是评估释放的改进情况,或者在评估运动中这样做。然而,对于这些系统如何与类似通信任务中的人类表现进行比较,或者最终用户如何看待这些系统的绩效,我们所知甚少。本文介绍了一项实验的结果,其目的是通过将同步语音翻译引擎与专业口译员的业绩进行比较,评价同步语音翻译引擎的质量。为了做到这一点,我们选择了为评估人类口译员而开发的一个框架,并使用这一框架对人和机器的性能进行人工评估。在我们的抽样中,我们发现人类口译员在智能化方面表现较好,而机器在信息性方面表现略好。我们讨论了研究的局限性和所选择的框架的可能改进。尽管存在内在局限性,但使用这一框架是朝着同时评价语言翻译的以用户中心和通信为导向的方法迈出的第一步。