Today, many state-of-the-art automatic speech recognition (ASR) systems apply all-neural models that map audio to word sequences trained end-to-end along one global optimisation criterion in a fully data driven fashion. These models allow high precision ASR for domains and words represented in the training material but have difficulties recognising words that are rarely or not at all represented during training, i.e. trending words and new named entities. In this paper, we use a text-to-speech (TTS) engine to provide synthetic audio for out-of-vocabulary (OOV) words. We aim to boost the recognition accuracy of a recurrent neural network transducer (RNN-T) on OOV words by using those extra audio-text pairs, while maintaining the performance on the non-OOV words. Different regularisation techniques are explored and the best performance is achieved by fine-tuning the RNN-T on both original training data and extra synthetic data with elastic weight consolidation (EWC) applied on encoder. This yields 57% relative word error rate (WER) reduction on utterances containing OOV words without any degradation on the whole test set.
翻译:今天,许多最先进的自动语音识别(ASR)系统都应用了全新模型,这些模型以完全数据驱动的方式,按照一个全球优化标准,按照一个全球优化标准,将音频到字序列,经过培训的终端到终端。这些模型对培训材料所代表的领域和字词允许高度精确的ASR,但难以识别培训期间很少或根本没有表现的词,即趋势单词和新命名实体。在本文件中,我们使用一个文本到语音(TTS)引擎,为校外(OOOV)词提供合成音频。我们的目标是通过使用这些额外的音频配对,提高OOV词上的经常性神经网络转换器(RNNN-T)的识别准确性,同时保持非OOOV词的性能。探索了不同的规范化技术,通过微调RNNE-T在原始培训数据上和额外合成数据加弹性重量(EWC),为在编码中应用的调重(EWC)提供合成音频的合成音频。我们的目标是,通过使用这些额外的音频转换器减少OOOV全文测试57%。