The Variational Autoencoder (VAE) is a powerful deep generative model that is now extensively used to represent high-dimensional complex data via a low-dimensional latent space learned in an unsupervised manner. In the original VAE model, input data vectors are processed independently. In recent years, a series of papers have presented different extensions of the VAE to process sequential data, that not only model the latent space, but also model the temporal dependencies within a sequence of data vectors and corresponding latent vectors, relying on recurrent neural networks. We recently performed a comprehensive review of those models and unified them into a general class called Dynamical Variational Autoencoders (DVAEs). In the present paper, we present the results of an experimental benchmark comparing six of those DVAE models on the speech analysis-resynthesis task, as an illustration of the high potential of DVAEs for speech modeling.
翻译:变异自动编码器(VAE)是一个强大的深深层基因模型,目前广泛用于通过一个以不受监督的方式学习的低维潜层空间代表高维复杂数据。在原VAE模型中,输入数据矢量是独立处理的。近年来,一系列论文展示了VAE在处理连续数据方面的不同扩展,不仅模拟了潜在空间,而且还模拟了数据矢量序列和相应潜在矢量中的时间依赖性,并依赖经常性神经网络。我们最近对这些模型进行了全面审查,并将其统一成一个称为动态变异自动变异器(DVAEs)的一般类。我们在本文件中介绍了对语音分析-合成任务中DVAE模型中六个模型进行比较的实验性基准结果,以说明DVAEs进行语音建模的高度潜力。