Previous generative adversarial network (GAN)-based neural vocoders are trained to reconstruct the exact ground truth waveform from the paired mel-spectrogram and do not consider the one-to-many relationship of speech synthesis. This conventional training causes overfitting for both the discriminators and the generator, leading to the periodicity artifacts in the generated audio signal. In this work, we present PhaseAug, the first differentiable augmentation for speech synthesis that rotates the phase of each frequency bin to simulate one-to-many mapping. With our proposed method, we outperform baselines without any architecture modification. Code and audio samples will be available at https://github.com/mindslab-ai/phaseaug.
翻译:先前的基因对抗网络(GAN)基于神经立体的神经立方体经过培训,从配对的光谱分光谱中重建准确的地面真象波形,不考虑语音合成的一对多种关系。这种常规培训使歧视者和生成者都无法适应,从而在生成的音频信号中产生了周期性人工制品。在这项工作中,我们介绍了“阶段Aug”,这是为模拟一对多图绘制而旋转每个频箱阶段的语音合成而首次有差异的增强。根据我们建议的方法,我们在不作任何建筑修改的情况下,超越了基线。代码和音频样本将在https://github.com/mindslab-ai/stiteaug上提供。</s>