Multiple studies in the past have shown that there is a strong correlation between human vocal characteristics and facial features. However, existing approaches generate faces simply from voice, without exploring the set of features that contribute to these observed correlations. A computational methodology to explore this can be devised by rephrasing the question to: "how much would a target face have to change in order to be perceived as the originator of a source voice?" With this in perspective, we propose a framework to morph a target face in response to a given voice in a way that facial features are implicitly guided by learned voice-face correlation in this paper. Our framework includes a guided autoencoder that converts one face to another, controlled by a unique model-conditioning component called a gating controller which modifies the reconstructed face based on input voice recordings. We evaluate the framework on VoxCelab and VGGFace datasets through human subjects and face retrieval. Various experiments demonstrate the effectiveness of our proposed model.
翻译:过去多项研究表明,人的声音特征和面部特征之间有着很强的关联性。然而,现有的方法只是从声音产生面孔,而没有探讨有助于这些观察到的关联性的一组特征。可以将一个探讨这一问题的计算方法改写为:“一个目标面孔需要改变多少才能被视为源声音的发源人?”从这个角度出发,我们提出了一个框架,在对一个特定声音作出反应时使一个目标面孔发生变化,其方式是面部特征以本文中学到的语音相关性为暗含指导。我们的框架包括一个导引的自动编码,将一个面部转换为另一个面部,由一种独特的模型-调节器控制,叫做“定位控制器,根据输入语音记录来改变重塑的面孔。我们通过人类主题和面部检索来评估VoxCelab和VGGFace数据集的框架。各种实验都证明了我们提议的模型的有效性。