The choice of modeling units affects the performance of the acoustic modeling and plays an important role in automatic speech recognition (ASR). In mandarin scenarios, the Chinese characters represent meaning but are not directly related to the pronunciation. Thus only considering the writing of Chinese characters as modeling units is insufficient to capture speech features. In this paper, we present a novel method involves with multi-level modeling units, which integrates multi-level information for mandarin speech recognition. Specifically, the encoder block considers syllables as modeling units, and the decoder block deals with character modeling units. During inference, the input feature sequences are converted into syllable sequences by the encoder block and then converted into Chinese characters by the decoder block. This process is conducted by a unified end-to-end model without introducing additional conversion models. By introducing InterCE auxiliary task, our method achieves competitive results with CER of 4.1%/4.6% and 4.6%/5.2% on the widely used AISHELL-1 benchmark without a language model, using the Conformer and the Transformer backbones respectively.
翻译:建模单位的选择会影响声学模型的性能,在自动语音识别(ASR)中起到重要作用。在曼达林情景中,中文字符代表意义,但与发音没有直接关系。因此,仅将中文字符的写作作为建模单位不足以捕捉语音特征。在本文中,我们提出了一个新颖的方法,涉及多级建模单位,它结合了多级建模单位的信息,用于汉达林语音识别。具体地说,编码器块将交错器视为建模单位,而解码器块则涉及字符建模单位。在推断过程中,输入特征序列由编码块转换成可调音序列,然后由解码块转换成中文字符。这一过程由一个统一的端到端模型进行,而不引入额外的转换模型。通过引入InterCE的辅助任务,我们的方法取得了竞争性的结果,在广泛使用的AISELLL-1基准中,不使用语言模型,分别使用 Constor和变压骨骨。