Consonant and vowel reduction are often encountered in speech, which might cause performance degradation in automatic speech recognition (ASR). Our recently proposed learning strategy based on masking, Phone Masking Training (PMT), alleviates the impact of such phenomenon in Uyghur ASR. Although PMT achieves remarkably improvements, there still exists room for further gains due to the granularity mismatch between the masking unit of PMT (phoneme) and the modeling unit (word-piece). To boost the performance of PMT, we propose multi-modeling unit training (MMUT) architecture fusion with PMT (PM-MMUT). The idea of MMUT framework is to split the Encoder into two parts including acoustic feature sequences to phoneme-level representation (AF-to-PLR) and phoneme-level representation to word-piece-level representation (PLR-to-WPLR). It allows AF-to-PLR to be optimized by an intermediate phoneme-based CTC loss to learn the rich phoneme-level context information brought by PMT. Experimental results on Uyghur ASR show that the proposed approaches outperform obviously the pure PMT. We also conduct experiments on the 960-hour Librispeech benchmark using ESPnet1, which achieves about 10% relative WER reduction on all the test set without LM fusion comparing with the latest official ESPnet1 pre-trained model.
翻译:我们最近提出的基于掩罩、电话遮护培训(PMT)的学习战略,减轻了这种现象在Uyghhur ASR中的影响。尽管PMT取得了显著的改进,但由于PMT(PMT)的遮掩单位和模型单位(字机)之间的遮掩单位(字机)之间的颗粒性不匹配,仍有进一步收益的余地。为了提高PMT的性能,我们提议采用多模模模模示范单位培训(MMMUUUT)结构与PMMT(MMM-MMUUT)结合(MMM-MMUT),这可能导致自动掩罩(PMTMUT)最近提出的学习战略,减轻了Uyghur、电话级代表(AF到-PLRRR)和电话级代表(PLLLL-1-WPLLLR)之间的表面代表(PLM-PLM),这让AF-PLR(F-PLRRR)之间的中间基于示范官方损失,以便学习PMTMTMT(PMTMT)带来的丰富手机最新实验结果实验结果结果结果,也明显减少了RLLLBB、SIM标准、SLLLLLLMLMLMLB的所有拟议的10基准,我们基准的削减办法,我们基准,我们基准基准的10(WB),这在10的削减方法,我们基准中明显BBBBBBBBBBBB),这在10的10的10的削减中明显BBBBBBBBBBB的10的10的10,它也明显中,它也使我們的10的10的10的10的10的BBBBBBBBBBBBBBBBBBBBBBBBBB,也明显,也直BBBBBBBBBBBBBBBBBBBBB的A,它也使我們的A,它的T的BBBBBBBBBBBA,它的BBBBBBA,它的S,它也使我們的10的B的B的SBBBB的B的B的B的B的B的