We describe our approach for the generative emotional vocal burst task (ExVo Generate) of the ICML Expressive Vocalizations Competition. We train a conditional StyleGAN2 architecture on mel-spectrograms of preprocessed versions of the audio samples. The mel-spectrograms generated by the model are then inverted back to the audio domain. As a result, our generated samples substantially improve upon the baseline provided by the competition from a qualitative and quantitative perspective for all emotions. More precisely, even for our worst-performing emotion (awe), we obtain an FAD of 1.76 compared to the baseline of 4.81 (as a reference, the FAD between the train/validation sets for awe is 0.776).
翻译:我们描述我们对于ICML 表现式自动演化竞赛(ExVoGenerate)的基因情感爆裂任务(ExVoGenerate)的处理方法,我们在音频样本预处理版本的Mel-spectrogram上培训一个有条件的StyleGAN2结构,然后将模型生成的Mel-spectrogram倒回音频域。结果,我们生成的样本从所有情感的质和量的角度从竞争提供的基准上大大改进。更准确地说,即使我们表现最差的情感(awe),我们获得了1.76的FAD,而基线是4.81(作为参考,Awe列列列/校对机组之间的FAD是0.776)。