The choice of modeling units is crucial for automatic speech recognition (ASR) tasks. In mandarin scenarios, the Chinese characters represent meaning but are not directly related to the pronunciation. Thus only considering the writing of Chinese characters as modeling units is insufficient to capture speech features. In this paper, we present a novel method involves with multi-level modeling units, which integrates multi-level information for mandarin speech recognition. Specifically, the encoder block considers syllables as modeling units and the decoder block deals with character-level modeling units. To facilitate the incremental conversion from syllable features to character features, we design an auxiliary task that applies cross-entropy (CE) loss to intermediate decoder layers. During inference, the input feature sequences are converted into syllable sequences by the encoder block and then converted into Chinese characters by the decoder block. Experiments on the widely used AISHELL-1 corpus demonstrate that our method achieves promising results with CER of 4.1%/4.6% and 4.6%/5.2%, using the Conformer and the Transformer backbones respectively.
翻译:建模单位的选择对于自动语音识别( ASR) 任务至关重要 。 在 曼达林 情景中, 中国字符代表意义, 但与发音没有直接关系 。 因此, 仅将中国字符的写作作为建模单位不足以捕捉语音特征 。 在本文中, 我们提出了一个新颖的方法, 涉及多级建模单位, 包括多级建模单位, 将多级建模信息整合为汉达林语音识别 。 具体地说, 编码器块将可建模单位的音调视为可建模单位, 解码器块块则与字符级建模单位打交道 。 为了便于将可调频特性从可调的特性逐步转换为字符特征, 我们设计了一个辅助任务, 将跨倍增的损( CE) 值应用到中间解码层 。 在推断过程中, 输入的特性序列由编码器块转换为可调序, 然后由解码器块转换成中国字符。 在广泛使用的 AISHELLL-1 上进行的实验表明, 我们的方法分别利用 Construsion 和变压骨骨骨骨骨骨骨骨骨, 获得4.%/ 4./4. 和 4.6% 4.6% 4.6% 和 4.6%/ 4.% 和 4.%/ 5. 和 4. 5/ 5. 5.2 。