End-to-end models, particularly Tacotron-based ones, are currently a popular solution for text-to-speech synthesis. They allow the production of high-quality synthesized speech with little to no text preprocessing. Indeed, they can be trained using either graphemes or phonemes as input directly. However, in the case of grapheme inputs, little is known concerning the relation between the underlying representations learned by the model and word pronunciations. This work investigates this relation in the case of a Tacotron model trained on French graphemes. Our analysis shows that grapheme embeddings are related to phoneme information despite no such information being present during training. Thanks to this property, we show that grapheme embeddings learned by Tacotron models can be useful for tasks such as grapheme-to-phoneme conversion and control of the pronunciation in synthetic speech.
翻译:终端到终端模型,特别是基于塔克通的模型,目前是一种流行的文本到语音合成的解决方案,可以制作高质量的合成语音,而很少甚至没有文本预处理。事实上,他们可以直接使用图形或电话作为输入器接受培训。但是,在石墨输入方面,对于模型和单词推进器所学到的基本表达方式之间的关系知之甚少。这项工作调查了在法国图形模型上培训的塔克通模型的这种关系。我们的分析表明,尽管在培训期间没有这种信息,但石墨嵌入与电话表信息有关。由于这种属性,我们显示,塔科坦模型所学的图形嵌入器对于诸如图形到语音转换和控制合成语音中的读音等任务可能有用。