Conventional vocoders are commonly used as analysis tools to provide interpretable features for downstream tasks such as speech synthesis and voice conversion. They are built under certain assumptions about the signals following signal processing principle, therefore, not easily generalizable to different audio, for example, from speech to singing. In this paper, we propose a deep neural analyzer, denoted as DeepA - a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the input speech that emulate those defined in conventional vocoders. Therefore, the resulting parameters are more interpretable than other latent neural representations. At the same time, as the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing. The proposed neural analyzer is built based on a variational autoencoder (VAE) architecture. We show that DeepA improves F0 estimation over the conventional vocoder (WORLD). To our best knowledge, this is the first study dedicated to the development of a neural framework for extracting learnable vocoder-like parameters.
翻译:常规读数器通常用作分析工具,为语音合成和语音转换等下游任务提供可解释的特性,因此,在根据信号处理原则对信号的某些假设下建立这些特性,因此,从演讲到歌唱等不同音频不易普及。在本文中,我们提出一个深神经分析器,称为DeepA-神经电解码器,从类似传统电解器定义的输入语句中提取F0和timembre/周期编码。因此,由此产生的参数比其他潜在神经表示法更易解释。与此同时,由于深神经分析器是可以学习的,预计它更精确地用于信号的重建和操纵,从演讲到歌唱。拟议的神经分析器是建立在变异自动电解码器(VAE)结构基础上的。我们显示,DeepA改进了传统电解码器(WORLD)的F0估计值。我们最了解的是,这是用于开发用于提取可学习的语音参数的神经框架的首项研究。