学习和控制语音的源-滤波器表示：一种变分自编码器方法 (Learning and controlling the source-filter representation of speech with a variational autoencoder)

Understanding and controlling latent representations in deep generative models is a challenging yet important problem for analyzing, transforming and generating various types of data. In speech processing, inspiring from the anatomical mechanisms of phonation, the source-filter model considers that speech signals are produced from a few independent and physically meaningful continuous latent factors, among which the fundamental frequency $f_0$ and the formants are of primary importance. In this work, we start from a variational autoencoder (VAE) trained in an unsupervised manner on a large dataset of unlabeled natural speech signals, and we show that the source-filter model of speech production naturally arises as orthogonal subspaces of the VAE latent space. Using only a few seconds of labeled speech signals generated with an artificial speech synthesizer, we propose a method to identify the latent subspaces encoding $f_0$ and the first three formant frequencies, we show that these subspaces are orthogonal, and based on this orthogonality, we develop a method to accurately and independently control the source-filter speech factors within the latent subspaces. Without requiring additional information such as text or human-labeled data, this results in a deep generative model of speech spectrograms that is conditioned on $f_0$ and the formant frequencies, and which is applied to the transformation speech signals. Finally, we also propose a robust $f_0$ estimation method that exploits the projection of a speech signal onto the learned latent subspace associated with $f_0$.

翻译：在深度生成模型中理解和控制潜在表示是一项具有挑战性但重要的问题，可用于分析、转换和生成各种类型的数据。在语音处理中，受声音产生解剖学机制的启发，源-滤波器模型认为，语音信号由几个独立且具有物理意义的连续潜在因子产生，其中基频$f_0$和共振峰是最重要的部分。在本文中，我们从使用无标签自然语音信号数据集进行无监督训练的变分自编码器（VAE）开始，展示了语音生成中源-滤波器模型自然涌现为VAE潜在空间的正交子空间。仅使用几秒钟的人工语音合成器生成的标记语音信号，我们提出了一种方法，可以识别编码成$f_0$和前三个共振峰频率的潜在子空间，我们展示了这些子空间是正交的，根据这个正交性，我们开发了一种方法，可以在潜在子空间内准确且独立地控制源-滤波器语音因素。不需要额外的信息，如文本或人工标记数据，这会带来一种基于$f_0$和共振峰频率条件化的语音频谱的深度生成模型，可以应用于转换语音信号。最后，我们还提出了一种稳健的$f_0$估计方法，利用语音信号在与$f_0$相关的学习潜在子空间上的投影。