We present a general framework to compute the word error rate (WER) of ASR systems that process recordings containing multiple speakers at their input and that produce multiple output word sequences (MIMO). Such ASR systems are typically required, e.g., for meeting transcription. We provide an efficient implementation based on a dynamic programming search in a multi-dimensional Levenshtein distance tensor under the constraint that a reference utterance must be matched consistently with one hypothesis output. This also results in an efficient implementation of the ORC WER which previously suffered from exponential complexity. We give an overview of commonly used WER definitions for multi-speaker scenarios and show that they are specializations of the above MIMO WER tuned to particular application scenarios. We conclude with a discussion of the pros and cons of the various WER definitions and a recommendation when to use which.
翻译:我们提出了一个总框架,用于计算ASR系统的单词错误率(WER),这些系统在输入时处理包含多个发言者的录音,并产生多个输出字序列(MIMO),这种ASR系统通常是需要的,例如用于会议抄录。我们在多维Levenshtein距离的动态编程搜索基础上,提供了高效的实施,其制约是参考语必须同一个假设输出相匹配。这也导致有效采用ORC WER, 而这以前曾受到指数复杂性的影响。我们概述了多语音情景中通常使用的 WER定义,并表明这些定义是以上IMO WER专门适应特定应用情景的。我们最后讨论了各种WER定义的利弊,并建议使用哪一种。