The quest for comprehensive generative models of intonation that link linguistic and paralinguistic functions to prosodic forms has been a longstanding challenge of speech communication research. More traditional intonation models have given way to the overwhelming performance of artificial intelligence (AI) techniques for training model-free, end-to-end mappings using millions of tunable parameters. The shift towards machine learning models has nonetheless posed the reverse problem - a compelling need to discover knowledge, to explain, visualise and interpret. Our work bridges between a comprehensive generative model of intonation and state-of-the-art AI techniques. We build upon the modelling paradigm of the Superposition of Functional Contours model and propose a Variational Prosody Model (VPM) that uses a network of deep variational contour generators to capture the context-sensitive variation of the constituent elementary prosodic cliches. We show that the VPM can give insight into the intrinsic variability of these prosodic prototypes through learning a meaningful prosodic latent space representation structure. We also show that the VPM brings improved modelling performance especially when such variability is prominent. In a speech synthesis scenario we believe the model can be used to generate a dynamic and natural prosody contour largely devoid of averaging effects.
翻译:寻求将语言和语言功能与预想形式联系起来的全面基因化模型是语言通信研究的一个长期挑战,更传统的基因化模型已经让位于利用数百万个金枪鱼参数进行无模型、端对端绘图培训的人工智能技术的压倒性表现。然而,向机器学习模型的转变也带来了反向问题,即迫切需要发现知识、解释、可视化和解释。我们的工作桥梁是将语言和语言功能功能与预想形式联系起来的全面基因化模型与最新人工智能技术之间的桥梁。我们以功能时装模型的超常定位模型模型模型为基础,并提议一种变异性推进模型模型模型,该模型使用深变异等成型生成器网络,以捕捉成份基本原生原型的内因变化。我们表明,机器化模型可以通过学习一个有意义的潜在空间代表结构来洞察这些原型的内在变异性。我们还表明,VPM能够改进模型的性能,特别是在这种变异性显著的情况下。在一种语音合成设想中,我们认为,模型基本上可以用来产生一种动态的模型。