In a hybrid automatic speech recognition (ASR) system, a pronunciation lexicon (PL) and a language model (LM) are essential to correctly retrieve spoken word sequences. Being a morphologically complex language, the vocabulary of Malayalam is so huge and it is impossible to build a PL and an LM that cover all diverse word forms. Usage of subword tokens to build PL and LM, and combining them to form words after decoding, enables the recovery of many out of vocabulary words. In this work we investigate the impact of using syllables as subword tokens instead of words in Malayalam ASR, and evaluate the relative improvement in lexicon size, model memory requirement and word error rate.
翻译:在混合自动语音识别系统(ASR)中,读音词汇和语言模型(LM)对于正确检索语音序列至关重要。作为一个形态复杂的语言,马来亚拉姆语的词汇非常庞大,不可能建立一个覆盖所有不同单词形式的PL和LM。使用子词符号来构建PL和LM,并在解码后将其组合成单词,可以恢复许多词汇外的词句。在这项工作中,我们调查了使用符号作为子词符号而不是马来亚拉姆语的影响,并评估了词汇大小、模型记忆要求和字词错误率方面的相对改进。