While speech recognition Word Error Rate (WER) has reached human parity for English, long-form dictation scenarios still suffer from segmentation and punctuation problems resulting from irregular pausing patterns or slow speakers. Transformer sequence tagging models are effective at capturing long bi-directional context, which is crucial for automatic punctuation. Automatic Speech Recognition (ASR) production systems, however, are constrained by real-time requirements, making it hard to incorporate the right context when making punctuation decisions. In this paper, we propose a streaming approach for punctuation or re-punctuation of ASR output using dynamic decoding windows and measure its impact on punctuation and segmentation accuracy across scenarios. The new system tackles over-segmentation issues, improving segmentation F0.5-score by 13.9%. Streaming punctuation achieves an average BLEU-score improvement of 0.66 for the downstream task of Machine Translation (MT).
翻译:虽然语言识别单词错误率(WER)在英语方面已达到了人与人之间的等同,但长式听力情景仍因不规则的悬浮模式或慢速扬声器而出现分解和标出问题。变换序列标记模型有效捕捉到长双向环境,这对自动标出至关重要。自动语音识别(ASR)生产系统受到实时要求的制约,因此难以在作出标出决定时纳入正确的环境。在本文中,我们提议采用动态解码窗口对自动拼出或重新标出自动识别和标出问题进行流出方法,并衡量其对跨情景的标出和分解准确度的影响。新系统解决了超分化问题,将分解F0.5-分数增加了13.9%, 冲出曲线使机械翻译下游任务的平均BLEU-分数改进了0.66。