In this paper we propose a conditioning trick, called difference departure from normality, applied on the generator network in response to instability issues during GAN training. We force the generator to get closer to the departure from normality function of real samples computed in the spectral domain of Schur decomposition. This binding makes the generator amenable to truncation and does not limit exploring all the possible modes. We slightly modify the BigGAN architecture incorporating residual network for synthesizing 2D representations of audio signals which enables reconstructing high quality sounds with some preserved phase information. Additionally, the proposed conditional training scenario makes a trade-off between fidelity and variety for the generated spectrograms. The experimental results on UrbanSound8k and ESC-50 environmental sound datasets and the Mozilla common voice dataset have shown that the proposed GAN configuration with the conditioning trick remarkably outperforms baseline architectures, according to three objective metrics: inception score, Frechet inception distance, and signal-to-noise ratio.
翻译:在本文中,我们提出了一个调试伎俩,称为与常态不同,在GAN培训期间,针对不稳定问题,应用在发电机网络上。我们迫使发电机更接近于脱离在Schur分解光谱域中计算的真实样品的正常功能。这个绑定使发电机易于脱轨,并不限制对所有可能模式的探索。我们略微修改了大GAN结构,将2D的音频信号的剩余显示网络合并起来,从而能够用某些保存的阶段信息重建高质量的声音。此外,拟议的有条件培训设想在生成的光谱的忠诚性和多样性之间作出权衡。关于8k和ESC-50环境声音数据集和Mozilla通用语音数据集的实验结果显示,拟议的GAN配置与调控器的组合明显优于基线结构,按照三个客观指标:初始评分、Frechet起始距离和信号对噪音比率。