Influenced by the field of Computer Vision, Generative Adversarial Networks (GANs) are often adopted for the audio domain using fixed-size two-dimensional spectrogram representations as the "image data". However, in the (musical) audio domain, it is often desired to generate output of variable duration. This paper presents VQCPC-GAN, an adversarial framework for synthesizing variable-length audio by exploiting Vector-Quantized Contrastive Predictive Coding (VQCPC). A sequence of VQCPC tokens extracted from real audio data serves as conditional input to a GAN architecture, providing step-wise time-dependent features of the generated content. The input noise z (characteristic in adversarial architectures) remains fixed over time, ensuring temporal consistency of global features. We evaluate the proposed model by comparing a diverse set of metrics against various strong baselines. Results show that, even though the baselines score best, VQCPC-GAN achieves comparable performance even when generating variable-length audio. Numerous sound examples are provided in the accompanying website, and we release the code for reproducibility.
翻译:在计算机视野领域的影响下,往往采用固定尺寸的二维光谱图示作为“图像数据”,对音域采用声音域采用固定尺寸的二维光谱图示,作为“图像数据”。然而,在(音乐)音域中,往往希望产生可变持续时间的输出。本文展示了VQCPC-GAN,这是利用矢量定量对立预测编码(VQCPC-GAN)合成多长音频的对立框架。从真实音频数据中提取的VQCPC标志序列,作为GAN结构的有条件输入,提供了生成内容的分步取时特征。输入噪音z(对抗性结构中的特点)在时间上保持不变,确保全球特征的时间一致性。我们通过对照各种强的基线比较一套不同的衡量标准来评估拟议的模型。结果显示,即使基线评分最佳,VQCPC-GAN即使在生成变长音频时,也取得了可比的性能。在相应的网站上提供了许多有说服力的例子,我们发布了可追溯性的代码。