Code-switching poses a number of challenges and opportunities for multilingual automatic speech recognition. In this paper, we focus on the question of robust and fair evaluation metrics. To that end, we develop a reference benchmark data set of code-switching speech recognition hypotheses with human judgments. We define clear guidelines for minimal editing of automatic hypotheses. We validate the guidelines using 4-way inter-annotator agreement. We evaluate a large number of metrics in terms of correlation with human judgments. The metrics we consider vary in terms of representation (orthographic, phonological, semantic), directness (intrinsic vs extrinsic), granularity (e.g. word, character), and similarity computation method. The highest correlation to human judgment is achieved using transliteration followed by text normalization. We release the first corpus for human acceptance of code-switching speech recognition results in dialectal Arabic/English conversation speech.
翻译:代码转换为多语种自动语音识别带来了许多挑战和机遇。在本文中,我们侧重于稳健和公正的评价指标问题。为此,我们开发了一套参考基准数据,根据人文判断来设定密码转换语音识别假设的假设。我们为自动假设的最小编辑制定了明确的指导方针。我们使用四向间翻译协议来验证准则。我们从与人类判断的相关性的角度来评估大量衡量标准。我们认为在代表性(体格学、声学、语义)、直接性(异性与外性)、颗粒性(如字词、特性)和相似性计算方法方面,衡量标准各不相同。与人类判断的最高相关性是在文本正常化之后通过翻写来实现的。我们发布了关于人类在方言阿拉伯语/英语谈话中接受密码转换语音识别结果的第一套材料。我们发布了关于人类接受方言语阿拉伯语/英语语音识别结果的第一套材料。