Automatic Speech Recognition (ASR) based on Recurrent Neural Network Transducers (RNN-T) is gaining interest in the speech community. We investigate data selection and preparation choices aiming for improved robustness of RNN-T ASR to speech disfluencies with a focus on partial words. For evaluation we use clean data, data with disfluencies and a separate dataset with speech affected by stuttering. We show that after including a small amount of data with disfluencies in the training set the recognition accuracy on the tests with disfluencies and stuttering improves. Increasing the amount of training data with disfluencies gives additional gains without degradation on the clean data. We also show that replacing partial words with a dedicated token helps to get even better accuracy on utterances with disfluencies and stutter. The evaluation of our best model shows 22.5% and 16.4% relative WER reduction on those two evaluation sets.
翻译:基于经常性神经网络传输器(RNN-T)的自动语音识别(ASR)正在引起对语言界的兴趣。我们调查数据选择和准备选择,目的是提高 RNN-T ASR 的稳健性,以部分文字为焦点来表达混乱。在评估中,我们使用清洁数据、有混乱的数据和单独数据集,其中含有受舌声影响的言语。我们显示,在将少量有混乱的数据纳入培训之后,在测试中设定了不稳定和吞吐改善的准确度。增加与混乱有关的培训数据数量使清洁数据在不退化的情况下取得了额外收益。我们还表明,用一个专用符号取代部分文字有助于提高言词的准确性。我们的最佳模型评估显示,在这两套评价组合中,22.5%和16.4%的相对WER降幅。