As a key component of automated speech recognition (ASR) and the front-end in text-to-speech (TTS), grapheme-to-phoneme (G2P) plays the role of converting letters to their corresponding pronunciations. Existing methods are either slow or poor in performance, and are limited in application scenarios, particularly in the process of on-device inference. In this paper, we integrate the advantages of both expert knowledge and connectionist temporal classification (CTC) based neural network and propose a novel method named LiteG2P which is fast, light and theoretically parallel. With the carefully leading design, LiteG2P can be applied both on cloud and on device. Experimental results on the CMU dataset show that the performance of the proposed method is superior to the state-of-the-art CTC based method with 10 times fewer parameters, and even comparable to the state-of-the-art Transformer-based sequence-to-sequence model with less parameters and 33 times less computation.
翻译:作为自动语音识别(ASR)和文本到语音(TTS)中前端的关键组成部分,石墨到语音(G2P)的作用是将字母转换成相应的发音。现有方法在性能方面要么缓慢,要么差,在应用情景方面受到限制,特别是在设计推论过程中。在本文件中,我们综合了专家知识和连接器时间分类(CTC)基于神经网络的优势,并提出了一个名为LiteG2P的新颖方法,该方法在理论上是快速、轻而平行的。在经过仔细引导的设计下,LiteG2P可以同时用于云层和装置上。CMU数据集的实验结果显示,拟议方法的性能优于基于最先进的CTC方法,其参数要少10倍,甚至与基于最先进的变压器的序列到后继模型相近,参数更少,计算也少33倍。</s>