In this paper, several works are proposed to address practical challenges for deploying RNN Transducer (RNN-T) based speech recognition system. These challenges are adapting a well-trained RNN-T model to a new domain without collecting the audio data, obtaining time stamps and confidence scores at word level. The first challenge is solved with a splicing data method which concatenates the speech segments extracted from the source domain data. To get the time stamp, a phone prediction branch is added to the RNN-T model by sharing the encoder for the purpose of force alignment. Finally, we obtain word-level confidence scores by utilizing several types of features calculated during decoding and from confusion network. Evaluated with Microsoft production data, the splicing data adaptation method improves the baseline and adaption with the text to speech method by 58.03% and 15.25% relative word error rate reduction, respectively. The proposed time stamping method can get less than 50ms word timing difference on average while maintaining the recognition accuracy of the RNN-T model. We also obtain high confidence annotation performance with limited computation cost
翻译:在本文中,针对部署基于 RNN NN Transporter (RNN-T) 语音识别系统的实际挑战,提出了几项工程建议,以解决部署基于 RNNN Transporter (RNNN-T) 的语音识别系统的实际挑战。 这些挑战正在将训练有素的 RNNN-T 模式改造成一个新的领域,而不收集音频数据,获得时间戳和字级信任分数。 第一个挑战是通过混合数据方法来解决,该方法将从源域数据中提取的语音部分混为一体。 为了获得时间戳,通过共享编码器进行电话预测,为部队对齐。 最后,我们通过使用在解码和混乱网络中计算的若干类型特征获得字级信任度评分。 通过微软生产数据评估, 组合数据调整方法使文本的基线和适应语音方法分别改进了58.03%和15.25%的相对单字错误率降低率。 拟议的时间戳法平均可得到不到50米字时间差,同时保持 RNNN-T 模型的准确性。 我们还通过使用有限的计算成本计算方法获得高度信任度评分。