In this paper, we present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset. These features are fed as an input sequence of prosodic labels to a prosody encoder module which augments an autoregressive attention-based text-to-speech model. We utilize various methods in order to improve prosodic control range and coverage, such as augmentation, F0 normalization, balanced clustering for duration and speaker-independent clustering. The final model enables fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. Instead of relying on reference utterances for inference, we introduce a prior prosody encoder which learns the style of each speaker and enables speech synthesis without the requirement of reference audio. We also fine-tune the multispeaker model to unseen speakers with limited amounts of data, as a realistic application scenario and show that the prosody control capabilities are maintained, verifying that the speaker-independent prosodic clustering is effective. Experimental results show that the model has high output speech quality and that the proposed method allows efficient prosody control within each speaker's range despite the variability that a multispeaker setting introduces.
翻译:在本文中,我们展示了一种使用直觉离散标签对 F0 和持续时间进行超声波偏移控制的新方法。 我们提出一个未经监督的偏振分组程序, 用于将多声调语音数据集的电话偏移级别F0和持续时间特性从多声调语音数据集中分离出来。 这些功能被作为偏移编码标签的输入序列输入到一个偏移编码模块, 该模块将增强以自动递增关注为基础的文本到语音模式。 我们使用各种方法来改进偏移控制范围和覆盖面, 如增强功能、 F0 正常化、 期限和依靠发言者组合的平衡组合。 最后模型允许对培训集中所有发言者进行精细微的语音偏移级别偏移控制,同时保持演讲人的身份特征。 我们不依赖引用引用引用引文模型, 并且不用引用音频调来进行语音合成。 我们还将多音频模型微调, 给拥有有限量声音组合组合的远音频组合模型, 用于核实每个发言者的动态组合, 并显示一个有效的演算结果, 显示一个有效的演算式演算式演算式组合, 显示一个有效的演算式演算式演算结果, 显示一个有效的演算式演算式组合显示一个有效的演算式控制方法, 显示一个有效的演算式的演算式的演算式演算式的演算式的演算法, 显示一个有效的演算法显示一个有效的演算式的演算式的演算法, 显示一个有效的演算式的演算法, 显示一个有效的演算式的演算法, 显示一个有效的演算法显示一个有效的演算法。