Generative adversarial networks (GANs) have been indicated their superiority in usage of the real-time speech synthesis. Nevertheless, most of them make use of deep convolutional layers as their backbone, which may cause the absence of previous signal information. However, the generation of speech signals invariably require preceding waveform samples in its reconstruction, as the lack of this can lead to artifacts in generated speech. To address this conflict, in this paper, we propose an improved model: a post auto-regressive (AR) GAN vocoder with a self-attention layer, which merging self-attention in an AR loop. It will not participate in inference, but can assist the generator to learn temporal dependencies within frames in training. Furthermore, an ablation study was done to confirm the contribution of each part. Systematic experiments show that our model leads to a consistent improvement on both objective and subjective evaluation performance.
翻译:在实时语音合成中,人们已经指出,产生对抗性网络(GANs)在使用实时语音合成时具有优越性,但大多数网络都利用深演层作为其主干,这可能导致缺乏先前的信号信息;然而,生成语音信号必然需要先进行波形样本的重建,因为缺乏这种样本会导致生成的语音中的文物。为了解决这一冲突,我们在本文件中提出了一个改进模式:自动回归后(AR)GAN voder与自我注意层相结合,将自我关注结合到AAR循环中。它不会参与推论,但可以帮助生成者在培训中学习框架中的时间依赖性。此外,还进行了反调研究,以确认每个部分的贡献。系统实验表明,我们的模型导致客观和主观评价业绩的不断改进。