Large-scale multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. In this work, we propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps by 1) introducing pseudo prompt enhancement with a distill-then-reprogram approach, it alleviates data scarcity with orders of magnitude concept compositions by using language-free audios; 2) leveraging spectrogram autoencoder to predict the self-supervised audio representation instead of waveforms. Together with robust contrastive language-audio pretraining (CLAP) representations, Make-An-Audio achieves state-of-the-art results in both objective and subjective benchmark evaluation. Moreover, we present its controllability and generalization for X-to-Audio with "No Modality Left Behind", for the first time unlocking the ability to generate high-definition, high-fidelity audios given a user-defined modality input. Audio samples are available at https://Text-to-Audio.github.io
翻译:在这项工作中,我们建议Make-An-Audio采用一个迅速增强的传播模式,通过一种蒸馏式再版方法来弥补这些差距,它通过使用无语言的音频来减轻数量级概念构成的数据稀缺性;2 利用光谱自动编码器来预测自我监督的音频代表,而不是波形。 与强有力的对比式语言-Audio预演(CLAP)演示一起,Make-An-Audio在客观和主观基准评价中都取得了最佳效果。此外,我们介绍了X-Audio的可控性和一般化情况,在首次解开生成高超音频/超音频模型的能力时,ADRD-del-Exmission-Audio