End-to-end automatic speech recognition (ASR) has achieved promising results. However, most existing end-to-end ASR methods neglect the use of specific language characteristics. For Mandarin Chinese ASR tasks, pinyin and character as writing and spelling systems respectively are mutual promotion in the Mandarin Chinese language. Based on the above intuition, we investigate types of related models that are suitable but not for joint pinyin-character ASR and propose a novel Mandarin Chinese ASR model with dual-decoder Transformer according to the characteristics of the pinyin transcripts and character transcripts. Specifically, the joint pinyin-character layer-wise linear interactive (LWLI) module and phonetic posteriorgrams adapter (PPGA) are proposed to achieve inter-layer multi-level interaction by adaptively fusing pinyin and character information. Furthermore, a two-stage training strategy is proposed to make training more stable and faster convergence. The results on the test sets of AISHELL-1 dataset show that the proposed Speech-Pinyin-Character-Interaction (SPCI) model without a language model achieves 9.85% character error rate (CER) on the test set, which is 17.71% relative reduction compared to baseline models based on Transformer.
翻译:端到端自动语音识别(ASR)取得了可喜的成果,然而,大多数现有的端到端自动语音识别(ASR)方法忽视了具体语言特征的使用。对于中华普通话的读写和拼写系统而言,中中文的任务、品莱因和性格特征分别是中中文的相互促进。根据上述直觉,我们调查适合但不适合但不适合品莱因字符的ASR联合型相关模型的类型,并根据品莱因笔录和字符誊本的特点,提出带有双分解变异器的新型中中文ASR模型。具体地说,拟用品莱因字形图示图层与线性线性互动(LWLLII)模块和语音成像成像成像成像成像调整器(PPGA),目的是通过适应性地使用品性素和性格信息实现多层次互动。此外,还提出了两阶段培训战略,以使培训更加稳定和更快地趋同。AISHELL-1数据集测试组的结果显示,拟议的语言-In-charater-线性互动互动(LLIPR-In Inter Intraction Recide Real Indepractalpractal resmational resmal resmation) rodududududududududududustrismusmational mismationalbleglegildismaldismaldismbismismismismismaldismlismus 。