In recent years, automatic speech-to-speech and speech-to-text translation has gained momentum thanks to advances in artificial intelligence, especially in the domains of speech recognition and machine translation. The quality of such applications is commonly tested with automatic metrics, such as BLEU, primarily with the goal of assessing improvements of releases or in the context of evaluation campaigns. However, little is known about how the output of such systems is perceived by end users or how they compare to human performances in similar communicative tasks. In this paper, we present the results of an experiment aimed at evaluating the quality of a real-time speech translation engine by comparing it to the performance of professional simultaneous interpreters. To do so, we adopt a framework developed for the assessment of human interpreters and use it to perform a manual evaluation on both human and machine performances. In our sample, we found better performance for the human interpreters in terms of intelligibility, while the machine performs slightly better in terms of informativeness. The limitations of the study and the possible enhancements of the chosen framework are discussed. Despite its intrinsic limitations, the use of this framework represents a first step towards a user-centric and communication-oriented methodology for evaluating real-time automatic speech translation.
翻译:近年来,由于人工智能的进步,特别是语音识别和机器翻译领域的进步,自动语音和语音对文本翻译获得了动力。这些应用的质量通常用自动测量仪进行测试,如BLEU,主要目的是评估释放的改进情况,或者在评价运动中这样做。然而,对于终端用户如何看待这些系统的产出,或者如何将其与类似通信任务中的人类表现进行比较,我们鲜为人知。本文介绍了一项实验的结果,其目的是通过将实时语音翻译引擎与专业同时翻译的性能进行比较,来评价实时语音翻译引擎的质量。为了做到这一点,我们采用了一个为评估人类口译员而开发的框架,并使用这一框架对人和机器的性能进行人工评估。在我们的抽样中,我们发现人类口译员在智能化方面表现较好,而机器在信息性方面表现略好。我们讨论了研究的局限性和所选择的框架可能得到加强。尽管存在内在局限性,但使用这一框架是朝着实际语音翻译以用户中心和通信为导向的自动翻译方法迈出的第一步。