Image-to-image translation and voice conversion enable the generation of a new facial image and voice while maintaining some of the semantics such as a pose in an image and linguistic content in audio, respectively. They can aid in the content-creation process in many applications. However, as they are limited to the conversion within each modality, matching the impression of the generated face and voice remains an open question. We propose a cross-modal style transfer framework called XFaVoT that jointly learns four tasks: image translation and voice conversion tasks with audio or image guidance, which enables the generation of ``face that matches given voice" and ``voice that matches given face", and intra-modality translation tasks with a single framework. Experimental results on multiple datasets show that XFaVoT achieves cross-modal style translation of image and voice, outperforming baselines in terms of quality, diversity, and face-voice correspondence.
翻译:图像到图像翻译和声音转换能够生成新的面部图像和声音,同时保持一些语义学,如在音频图像和语言内容中显示面部和语言内容。它们可以在许多应用程序中帮助内容制作过程。然而,由于它们仅限于在每种模式中转换,因此与生成面部和声音的印象相匹配仍然是一个未决问题。我们提议了一个名为 XFAVoT 的跨模式风格传输框架,它可以共同学习四项任务:图像翻译和声音转换任务以及音频或图像指导,从而能够生成“与特定声音匹配面部和声音匹配面部”的“面部”和“声音”以及一个单一框架的内式翻译任务。多数据集的实验结果显示, XFAVoT 实现了图像和声音的跨模式翻译,在质量、多样性和面音通信方面超过了基准。</s>