Understanding and controlling latent representations in deep generative models is a challenging yet important problem for analyzing, transforming and generating various types of data. In speech processing, inspiring from the anatomical mechanisms of phonation, the source-filter model considers that speech signals are produced from a few independent and physically meaningful continuous latent factors, among which the fundamental frequency $f_0$ and the formants are of primary importance. In this work, we show that the source-filter model of speech production naturally arises in the latent space of a variational autoencoder (VAE) trained in an unsupervised manner on a dataset of natural speech signals. Using only a few seconds of labeled speech signals generated with an artificial speech synthesizer, we experimentally illustrate that $f_0$ and the formant frequencies are encoded in orthogonal subspaces of the VAE latent space and we develop a weakly-supervised method to accurately and independently control these speech factors of variation within the learned latent subspaces. Without requiring additional information such as text or human-labeled data, this results in a deep generative model of speech spectrograms that is conditioned on $f_0$ and the formant frequencies, and which is applied to the transformation of speech signals.
翻译:在分析、转换和生成各种类型的数据方面,深层基因模型的潜在理解和控制潜在表现是一个具有挑战性但又很严重的问题。在语音处理中,发自光化解剖机制,源过滤器模型认为,语音信号来自几个独立和具有实际意义的连续潜在因素,其中基本频率为$_0美元,形成方体是最重要的。在这项工作中,我们表明,语音制作的源过滤模型自然地出现在一个变异自动自动读数器(VAE)的潜在空间的潜在空间中,以不受监督的方式对自然语音信号数据集进行培训。我们仅使用几秒钟由人工语音合成器生成的贴标签语音信号,实验性地表明,$0美元和形成方位频率是VAE潜在空间的或分层子空间的编码,我们开发了一种薄弱的超强方法,以准确和独立控制所学的潜值子空间内变异的这些语音要素。在不要求文字或人标数据等额外信息的情况下,我们用一个深层语音光谱化模型,将这种结果应用到语音光谱图的频率和频率。