Neural vocoders based on the generative adversarial neural network (GAN) have been widely used due to their fast inference speed and lightweight networks while generating high-quality speech waveforms. Since the perceptually important speech components are primarily concentrated in the low-frequency bands, most GAN-based vocoders perform multi-scale analysis that evaluates downsampled speech waveforms. This multi-scale analysis helps the generator improve speech intelligibility. However, in preliminary experiments, we discovered that the multi-scale analysis which focuses on the low-frequency bands causes unintended artifacts, e.g., aliasing and imaging artifacts, which degrade the synthesized speech waveform quality. Therefore, in this paper, we investigate the relationship between these artifacts and GAN-based vocoders and propose a GAN-based vocoder, called Avocodo, that allows the synthesis of high-fidelity speech with reduced artifacts. We introduce two kinds of discriminators to evaluate speech waveforms in various perspectives: a collaborative multi-band discriminator and a sub-band discriminator. We also utilize a pseudo quadrature mirror filter bank to obtain downsampled multi-band speech waveforms while avoiding aliasing. According to experimental results, Avocodo outperforms baseline GAN-based vocoders, both objectively and subjectively, while reproducing speech with fewer artifacts.
翻译:基于基因对抗神经神经网络(GAN)的神经立体神经立体变形器已被广泛使用,原因是其快速发酵速度和轻量网络,同时产生了高质量的语音波形。由于感知性重要的语音组成部分主要集中于低频波段,大多数基于GAN的语音构件进行多尺度的分析,对降压的语音波形进行了评估。这一多尺度分析有助于发电机改进语音感知性。然而,在初步实验中,我们发现,侧重于低频波段的多尺度分析造成意外的艺术品,例如别名和成像制品,降低了合成语音波形质量。因此,在本文中,我们调查了这些人工制品与基于GAN的语音构件之间的关系,并提出了一种基于GAN的语音构件,称为Avocodo,这样可以将基于高纤维的言词与较少的语音构件合成。我们引入了两种类型的歧视器来评估各种观点的语音波形:合作多波带歧视器和成像成型成型的成型成型成型工艺品,我们利用了银行的多波段式基质分析器制成型分析器,同时获取了一种镜形的图像成型的图像成型变形。我们还利用了一种银行制制制制成型的图像成型的镜制制制成型的图像制成型的图像制制制制成型的镜制成型的图像制制成型和制制制成式的图像制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制制。