The success of the multilingual automatic speech recognition systems empowered many voice-driven applications. However, measuring the performance of such systems remains a major challenge, due to its dependency on manually transcribed speech data in both mono- and multilingual scenarios. In this paper, we propose a novel multilingual framework -- eWER3 -- jointly trained on acoustic and lexical representation to estimate word error rate. We demonstrate the effectiveness of eWER3 to (i) predict WER without using any internal states from the ASR and (ii) use the multilingual shared latent space to push the performance of the close-related languages. We show our proposed multilingual model outperforms the previous monolingual word error rate estimation method (eWER2) by an absolute 9\% increase in Pearson correlation coefficient (PCC), with better overall estimation between the predicted and reference WER.
翻译:成功的多语言语音识别系统赋予了许多语音应用程序无限的可能性。然而,由于在单语和多语言场景下依赖手动转录的语音数据,因此测量这些系统的性能仍然是一个主要的挑战。 在本文中,我们提出了一个新颖的多语言框架——eWER3——它是在声学和词汇表示上联合训练的,用于估算词错误率。我们展示了eWER3的有效性,可以(i)在不使用ASR的任何内部状态的情况下预测WER,以及(ii)使用多语言共享的潜在空间来推动相关语言的性能。我们展示了我们提出的多语言模型比之前的单语言词错误率估计方法(eWER2)在皮尔逊相关系数(PCC)上提高了9个百分点的绝对值,并且在预测和参考WER之间的整体估计更好。