While end-to-end models have shown great success on the Automatic Speech Recognition task, performance degrades severely when target sentences are long-form. The previous proposed methods, (partial) overlapping inference are shown to be effective on long-form decoding. For both methods, word error rate (WER) decreases monotonically when overlapping percentage decreases. Setting aside computational cost, the setup with 50% overlapping during inference can achieve the best performance. However, a lower overlapping percentage has an advantage of fast inference speed. In this paper, we first conduct comprehensive experiments comparing overlapping inference and partial overlapping inference with various configurations. We then propose Voice-Activity-Detection Overlapping Inference to provide a trade-off between WER and computation cost. Results show that the proposed method can achieve a 20% relative computation cost reduction on Librispeech and Microsoft Speech Language Translation long-form corpus while maintaining the WER performance when comparing to the best performing overlapping inference algorithm. We also propose Soft-Match to compensate for similar words mis-aligned problem.
翻译:虽然终端到终端模型在自动语音识别任务上表现出了巨大的成功,但当目标句是长式的,性能会严重下降。先前建议的方法,(部分)重叠的推论在长式解码上显示有效。对于这两种方法,单词错误率(WER)在重叠百分比下降时会单数降低。省略计算成本,在推论期间设定50%重叠的设定可以取得最佳的性能。然而,一个较低的重叠百分比具有快速推断速度的优势。在本文中,我们首先进行综合实验,比较重叠的推论和与各种配置的部分重叠推论。我们然后提议进行声音-动作检测重叠推论,以便在 WER 和计算成本之间作出权衡。结果显示,拟议方法可以实现对Librispeech 和 Microsoft Speapeech 语言翻译长式的相对成本降低20%,同时保持WER的性能与最佳的重复计算算算法相比较。我们还建议对类似词错误问题进行Soft-Match的补偿。