End-to-end speech synthesis models directly convert the input characters into an audio representation (e.g., spectrograms). Despite their impressive performance, such models have difficulty disambiguating the pronunciations of identically spelled words. To mitigate this issue, a separate Grapheme-to-Phoneme (G2P) model can be employed to convert the characters into phonemes before synthesizing the audio. This paper proposes SoundChoice, a novel G2P architecture that processes entire sentences rather than operating at the word level. The proposed architecture takes advantage of a weighted homograph loss (that improves disambiguation), exploits curriculum learning (that gradually switches from word-level to sentence-level G2P), and integrates word embeddings from BERT (for further performance improvement). Moreover, the model inherits the best practices in speech recognition, including multi-task learning with Connectionist Temporal Classification (CTC) and beam search with an embedded language model. As a result, SoundChoice achieves a Phoneme Error Rate (PER) of 2.65% on whole-sentence transcription using data from LibriSpeech and Wikipedia. Index Terms grapheme-to-phoneme, speech synthesis, text-tospeech, phonetics, pronunciation, disambiguation.
翻译:终端到终端语音合成模型直接将输入字符转换成音频表达式(例如光谱图 ) 。 尽管这些模型的性能令人印象深刻, 却难以掩盖相同拼写词的发音。 为了缓解这一问题, 在合成音频之前, 可以使用一个单独的图形化合成模型( G2P ) 将字符转换成语音。 本文提议了 SoundChoice, 这是一种处理整个句子而不是在字级操作的新型G2P 结构。 所拟议的结构利用了加权同质系统损失( 改进了调和), 利用课程学习( 从字级逐渐转换到句级G2P ), 并整合了来自BERT( 进一步改进性能) 的单词嵌入。 此外, 模型继承了语音识别的最佳做法, 包括与连接温度分类( CTC) 进行多功能化学习, 并用嵌入的语言模型进行搜索。 因此, 声音化公司在整部的语音错误率( PER ) 中, 在整部、 读取的语音- IMVIBS- IP IP IP 上的数据, IP 。