Fundamental modelling differences between hybrid and end-to-end (E2E) automatic speech recognition (ASR) systems create large diversity and complementarity among them. This paper investigates multi-pass rescoring and cross adaptation based system combination approaches for hybrid TDNN and Conformer E2E ASR systems. In multi-pass rescoring, state-of-the-art hybrid LF-MMI trained CNN-TDNN system featuring speed perturbation, SpecAugment and Bayesian learning hidden unit contributions (LHUC) speaker adaptation was used to produce initial N-best outputs before being rescored by the speaker adapted Conformer system using a 2-way cross system score interpolation. In cross adaptation, the hybrid CNN-TDNN system was adapted to the 1-best output of the Conformer system or vice versa. Experiments on the 300-hour Switchboard corpus suggest that the combined systems derived using either of the two system combination approaches outperformed the individual systems. The best combined system obtained using multi-pass rescoring produced statistically significant word error rate (WER) reductions of 2.5% to 3.9% absolute (22.5% to 28.9% relative) over the stand alone Conformer system on the NIST Hub5'00, Rt03 and Rt02 evaluation data.
翻译:混合和端到端自动语音识别(E2E)系统之间的基本建模差异在混合和端到端自动语音识别(ASR)系统之间建立了巨大的多样性和互补性。本文件调查了混合TDNN和Confored E2E ASR系统的多通道重新校准和交叉基于适应的系统组合方法。在多通道重新校准、最先进的混合LF-MMI培训了CNN-TDNN系统,该系统具有快速扰动、分解和巴耶斯学习式自动语音识别(LHUC)系统之间的基本建模差异,以产生初步的N最佳产出,然后由演讲者使用双向跨系统分解的Conexect 系统重新校准。在交叉调适中,对CNN-TDNN系统进行了调整,以适应组合系统的1个最佳输出。在300小时开关机堆上进行的实验表明,使用两种系统组合方法中两种方法之一产生的综合系统都超越了单个系统。在使用多传回校准生成的具有统计意义的单词错误率(WER)下将2.5%降至3.9%(22.5)至RIS前的RIS203的系统降为28.9%。