End-to-End speech-to-speech translation (S2ST) is generally evaluated with text-based metrics. This means that generated speech has to be automatically transcribed, making the evaluation dependent on the availability and quality of automatic speech recognition (ASR) systems. In this paper, we propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems. BLASER leverages a multilingual multimodal encoder to directly encode the speech segments for source input, translation output and reference into a shared embedding space and computes a score of the translation quality that can be used as a proxy to human evaluation. To evaluate our approach, we construct training and evaluation sets from more than 40k human annotations covering seven language directions. The best results of BLASER are achieved by training with supervision from human rating scores. We show that when evaluated at the sentence level, BLASER correlates significantly better with human judgment compared to ASR-dependent metrics including ASR-SENTBLEU in all translation directions and ASR-COMET in five of them. Our analysis shows combining speech and text as inputs to BLASER does not increase the correlation with human scores, but best correlations are achieved when using speech, which motivates the goal of our research. Moreover, we show that using ASR for references is detrimental for text-based metrics.
翻译:在本文中,我们建议为终端到终端语音对语音的翻译(S2ST)使用基于文本的衡量标准(S2ST)进行一般评价。这意味着生成的语音必须自动转录,使评价取决于自动语音识别系统的可用性和质量。在本文中,我们建议为终端到终端S2ST(名为BLASER)提供无文本评价指标,以避免对ASR系统的依赖。BLASER利用多语种多式联运编码器直接将语音部分编码成源投入、翻译输出和引用的共享嵌入空间,并计算出可用作人类评价替代手段的翻译质量分数。为了评估我们的方法,我们从涵盖7种语言方向的40公里以上的人类说明构建了培训和评价成套材料。BLASER的最佳成果是通过使用人类评级分数的监管来实现的。我们从判决级别上评价时,BLASER与依赖的衡量标准(包括ASR-SentBEU)相比,包括所有翻译方向和ASR-COMETA等语言质量的评分数。我们的分析用5种语言的文本来显示最佳做法,但是我们使用BSER的对比是BA/BA/In。