Exploiting effective target modeling units is very important and has always been a concern in end-to-end automatic speech recognition (ASR). In this work, we propose a phonetic-assisted multi-target units (PMU) modeling approach, to enhance the Conformer-Transducer ASR system in a progressive representation learning manner. Specifically, PMU first uses the pronunciation-assisted subword modeling (PASM) and byte pair encoding (BPE) to produce phonetic-induced and text-induced target units separately; Then, three new frameworks are investigated to enhance the acoustic encoder, including a basic PMU, a paraCTC and a pcaCTC, they integrate the PASM and BPE units at different levels for CTC and transducer multi-task training. Experiments on both LibriSpeech and accented ASR tasks show that, the proposed PMU significantly outperforms the conventional BPE, it reduces the WER of LibriSpeech clean, other, and six accented ASR testsets by relative 12.7%, 6.0% and 7.7%, respectively.
翻译:开发有效的目标建模单位非常重要,并且一直是终端到终端自动语音识别中的一个关切问题。在这项工作中,我们提议采用语音辅助多目标单位(PMU)建模方法,以渐进式教学方式加强Confront-Transer ASR系统。具体地说,PMU首先使用读音辅助小词建模(PASM)和边对子编码(BPE),分别生成语音诱发和文本诱发的目标单位;然后,对三个新框架进行调查,以加强声学编码器,包括基本的PMU、准立方体和小卡立方体,它们将PASM和BPE单元纳入不同的级别,用于CTC和跨导体多任务培训。关于LibriSpeech和强调ASR任务的实验显示,拟议的PMU大大超出常规的BPE, 将LibriSpeech清洁、其他和6个突出的ASR测试器的WER分别减少12.7%、6.0%和7.7%。