Wave-U-Net 判别器：基于生成式对抗网络的语音合成快速轻量级判别器 (Wave-U-Net Discriminator: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis)

In speech synthesis, a generative adversarial network (GAN), training a generator (speech synthesizer) and a discriminator in a min-max game, is widely used to improve speech quality. An ensemble of discriminators is commonly used in recent neural vocoders (e.g., HiFi-GAN) and end-to-end text-to-speech (TTS) systems (e.g., VITS) to scrutinize waveforms from multiple perspectives. Such discriminators allow synthesized speech to adequately approach real speech; however, they require an increase in the model size and computation time according to the increase in the number of discriminators. Alternatively, this study proposes a Wave-U-Net discriminator, which is a single but expressive discriminator with Wave-U-Net architecture. This discriminator is unique; it can assess a waveform in a sample-wise manner with the same resolution as the input signal, while extracting multilevel features via an encoder and decoder with skip connections. This architecture provides a generator with sufficiently rich information for the synthesized speech to be closely matched to the real speech. During the experiments, the proposed ideas were applied to a representative neural vocoder (HiFi-GAN) and an end-to-end TTS system (VITS). The results demonstrate that the proposed models can achieve comparable speech quality with a 2.31 times faster and 14.5 times more lightweight discriminator when used in HiFi-GAN and a 1.90 times faster and 9.62 times more lightweight discriminator when used in VITS. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/waveunetd/.

翻译：在语音合成中，生成对抗网络（GAN）利用一个生成器（语音合成器）和一个判别器进行极小极大博弈来提高语音质量。最近的神经声码器（例如 HiFi-GAN）和端到端文本到语音（TTS）系统（例如 VITS）通常使用多个判别器的集合来从多个角度审查波形。这些判别器使得合成语音可以充分接近真实语音。然而，随着判别器数量的增加，它们需要增加模型大小和计算时间。为此，本文提出了 Wave-U-Net 判别器，这是一个基于 Wave-U-Net结构的单个、表达能力强的判别器。该判别器独特之处在于：它可以以与输入信号相同的分辨率对样本的波形进行逐个样本的评估，同时通过具有跳跃连接的编码器和解码器提取多级特征。这种架构为生成器提供了足够丰富的信息，使合成语音与真实语音紧密匹配。在实验中，提出的想法应用于代表性的神经声码器（HiFi-GAN）和端到端TTS系统（VITS）。结果表明，当在 HiFi-GAN 中使用时，所提出的模型可以实现可比的语音质量，且速度是传统做法的 2.31 倍，判别器的大小是传统做法的14.5 倍；当在 VITS 中使用时，所提出的模型可以实现可比的语音质量，且速度是传统做法的 1.90 倍，判别器的大小是传统做法的9.62 倍。音频样本可在 https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/waveunetd/ 上获得。