In modern interactive speech-based systems, speech is consumed and transcribed incrementally prior to having disfluencies removed. This post-processing step is crucial for producing clean transcripts and high performance on downstream tasks (e.g. machine translation). However, most current state-of-the-art NLP models such as the Transformer operate non-incrementally, potentially causing unacceptable delays. We propose a streaming BERT-based sequence tagging model that, combined with a novel training objective, is capable of detecting disfluencies in real-time while balancing accuracy and latency. This is accomplished by training the model to decide whether to immediately output a prediction for the current input or to wait for further context. Essentially, the model learns to dynamically size its lookahead window. Our results demonstrate that our model produces comparably accurate predictions and does so sooner than our baselines, with lower flicker. Furthermore, the model attains state-of-the-art latency and stability scores when compared with recent work on incremental disfluency detection.
翻译:在现代交互式语音系统中,语言在去除不通气之前就消耗和转录。这一后处理步骤对于制作干净的记录誊本和下游任务(例如机器翻译)的高性能至关重要。然而,目前大多数最先进的NLP模型,例如变压器不催化,可能造成不可接受的延误。我们提出了一个流动的基于BERT的序列标记模型,该模型与新的培训目标相结合,能够发现实时的不便,同时平衡准确性和延缓性。这通过培训模型来决定是立即输出对当前输入的预测还是等待进一步的背景。基本上,模型学会动态地缩放其外观窗口。我们的结果表明,我们的模型可以比较准确的预测,而且比我们的基线早,而且更亮。此外,该模型在与最近关于递增的不通融性探测工作相比,可以达到最新水平的拉差和稳定性分数。