Expressive text-to-speech (TTS) aims to synthesize different speaking style speech according to human's demands. Nowadays, there are two common ways to control speaking styles: (1) Pre-defining a group of speaking style and using categorical index to denote different speaking style. However, there are limitations in the diversity of expressiveness, as these models can only generate the pre-defined styles. (2) Using reference speech as style input, which results in a problem that the extracted style information is not intuitive or interpretable. In this study, we attempt to use natural language as style prompt to control the styles in the synthetic speech, \textit{e.g.}, ``Sigh tone in full of sad mood with some helpless feeling". Considering that there is no existing TTS corpus which is proper to benchmark this novel task, we first construct a speech corpus, whose speech samples are annotated with not only content transcriptions but also style descriptions in natural language. Then we propose an expressive TTS model, named as InstructTTS, which is novel in the sense of following aspects: (1) We fully take the advantage of self-supervised learning and cross-modal metric learning, and propose a novel three-stage training procedure to obtain a robust sentence embedding model, which can effectively capture semantic information from the style prompts and control the speaking style in the generated speech. (2) We propose to model acoustic features in discrete latent space and train a novel discrete diffusion probabilistic model to generate vector-quantized (VQ) acoustic tokens rather than the commonly-used mel spectrogram. (3) We jointly apply mutual information (MI) estimation and minimization during acoustic model training to minimize style-speaker and style-content MI, avoiding possible content and speaker information leakage from the style prompt.
翻译:表达式对声音的表达式( TTS) 旨在根据人的需求合成不同的语言风格语言。 如今, 我们试图使用两种常见的方法来控制语言风格:(1) 预先定义一组语言风格并使用绝对指数来表示不同的语言风格。 但是, 表达式的多样性存在局限性, 因为这些模型只能产生预定义的风格。 (2) 使用参考语言作为样式输入, 由此产生一个问题, 即提取的风格信息不是直观的或可解释的。 在这项研究中, 我们试图使用自然语言作为控制合成语言风格风格风格的快速风格,\ textitleit{e.g.}, “ 以完全悲伤的语调来表达不同的语言风格 ” 。 然而,考虑到现有的 TTS 堆并不适合设定预先定义的风格。 我们首先构建一个语音模板, 不仅对内容进行注释性描述,而且对自然语言进行风格描述。 然后我们提出一个直观的 TTTS 模式, 称为 示式TTTS, 这是一种新颖的表达式风格 。 (1) 我们充分展示了高级的自我智能和感官压式学习程序。