End-to-end (E2E) automatic speech recognition (ASR) systems often have difficulty recognizing uncommon words, that appear infrequently in the training data. One promising method, to improve the recognition accuracy on such rare words, is to latch onto personalized/contextual information at inference. In this work, we present a novel context-aware transformer transducer (CATT) network that improves the state-of-the-art transformer-based ASR system by taking advantage of such contextual signals. Specifically, we propose a multi-head attention-based context-biasing network, which is jointly trained with the rest of the ASR sub-networks. We explore different techniques to encode contextual data and to create the final attention context vectors. We also leverage both BLSTM and pretrained BERT based models to encode contextual data and guide the network training. Using an in-house far-field dataset, we show that CATT, using a BERT based context encoder, improves the word error rate of the baseline transformer transducer and outperforms an existing deep contextual model by 24.2% and 19.4% respectively.
翻译:终端到终端( E2E) 自动语音识别( ASR) 系统往往难以识别异常单词,这些单词在培训数据中似乎并不常见。 提高这类稀有单词的识别准确性的一个有希望的方法是,在推理时将个人化/文本信息连接到个人化/背景信息上。 在这项工作中,我们展示了一个新的上下文觉变压器转换器(CATT)网络,利用这些背景信号改进基于最先进的变压器的ASR系统。 具体地说,我们提议建立一个多领导人关注背景偏向网络,与ASR子网络的其余部分共同培训。我们探索了对背景数据进行编码和创建最终关注矢量的不同技术。 我们还利用基于BLSTM和预先培训的BERT模型来编码背景数据并指导网络培训。 我们用内部远域数据集显示,CATT使用基于 BERT 的背景编码器,提高基线变压器的字差率率,并分别提高24.2%和19.4%的现有深背景模型。