While generative adversarial networks (GANs) have been widely used in research on audio generation, the training of a GAN model is known to be unstable, time consuming, and data inefficient. Among the attempts to ameliorate the training process of GANs, the idea of Projected GAN emerges as an effective solution for GAN-based image generation, establishing the state-of-the-art in different image applications. The core idea is to use a pre-trained classifier to constrain the feature space of the discriminator to stabilize and improve GAN training. This paper investigates whether Projected GAN can similarly improve audio generation, by evaluating the performance of a StyleGAN2-based audio-domain loop generation model with and without using a pre-trained feature space in the discriminator. Moreover, we compare the performance of using a general versus domain-specific classifier as the pre-trained audio classifier. With experiments on both drum loop and synth loop generation, we show that a general audio classifier works better, and that with Projected GAN our loop generation models can converge around 5 times faster without performance degradation.
翻译:虽然基因对抗网络(GANs)在音频生成研究中被广泛使用,但GAN模型的培训已知不稳定、耗时和数据效率低下。在改善GANs培训过程的尝试中,预测GAN的想法成为GAN图像生成的有效解决办法,在不同图像应用中建立了最新技术。核心想法是使用预先训练的分类器限制歧视者稳定并改进GAN培训的特性空间。本文调查了GAN模型的预测能否通过评价StyleGAN2-基于SyleGAN2的音频光环生成模型的性能来同样改善音频生成。此外,我们比较了使用通用的和特定域的分类器作为预先训练的音频分类器的性能。在鼓轮和合成环生成方面的实验,我们显示一般的音频分类器效果更好,并且通过预测GAN2的循环生成模型在不造成性性能退化的情况下可以集中5倍左右。