This paper describes a novel design of a neural network-based speech generation model for learning prosodic representation.The problem of representation learning is formulated according to the information bottleneck (IB) principle. A modified VQ-VAE quantized layer is incorporated in the speech generation model to control the IB capacity and adjust the balance between reconstruction power and disentangle capability of the learned representation. The proposed model is able to learn word-level prosodic representations from speech data. With an optimized IB capacity, the learned representations not only are adequate to reconstruct the original speech but also can be used to transfer the prosody onto different textual content. Extensive results of the objective and subjective evaluation are presented to demonstrate the effect of IB capacity control, the effectiveness, and potential usage of the learned prosodic representation in controllable neural speech generation.
翻译:本文描述了一种新颖的神经网络生成语音模型,用于学习假言。 代言学习问题是根据信息瓶颈原则拟订的。经修改的VQ-VAE量化层被纳入语音生成模型,以控制代言人的能力,调整重建能力与分解能力之间的平衡。拟议模型能够从语音数据中学习文字级预言。有了优化的内行能力,所学的代言不仅足以重建原言,还可以用来将代言转换为不同的文字内容。目标和主观评价的广泛结果展示了内行能力控制、有效性和在可控制的内音生成中使用已学的代言法的潜在作用。