This work focuses on the analysis that whether 3D face models can be learned from only the speech inputs of speakers. Previous works for cross-modal face synthesis study image generation from voices. However, image synthesis includes variations such as hairstyles, backgrounds, and facial textures, that are arguably irrelevant to voice or without direct studies to show correlations. We instead investigate the ability to reconstruct 3D faces to concentrate on only geometry, which is more physiologically grounded. We propose both the supervised learning and unsupervised learning frameworks. Especially we demonstrate how unsupervised learning is possible in the absence of a direct voice-to-3D-face dataset under limited availability of 3D face scans when the model is equipped with knowledge distillation. To evaluate the performance, we also propose several metrics to measure the geometric fitness of two 3D faces based on points, lines, and regions. We find that 3D face shapes can be reconstructed from voices. Experimental results suggest that 3D faces can be reconstructed from voices, and our method can improve the performance over the baseline. The best performance gains (15% - 20%) on ear-to-ear distance ratio metric (ER) coincides with the intuition that one can roughly envision whether a speaker's face is overall wider or thinner only from a person's voice. See our project page for codes and data.
翻译:这项工作侧重于分析 3D 面部模型是否只能从演讲者的语言投入中学习 。 以往的跨模式面部合成研究用声音生成图像的工作 。 然而, 图像合成包括发型、 背景和面部纹理等变异, 与声音无关, 或者没有直接研究来显示相关性 。 相反, 我们调查重建 3D 面部的能力, 仅集中在几何学上, 这在生理上更有根基 。 我们建议监督的学习和不受监督的学习框架 。 特别是我们演示在没有直接语音到 3D 面部数据集的情况下, 在有限的3D 面部扫描可用性数据的情况下, 如何实现不受监督的学习 。 当模型配备了知识蒸馏功能时, 图像合成包括3D 3D 面部样的变异性。 为了评估这些变异性, 我们还建议了若干衡量3D 面部脸部的几何美性能, 我们发现3D 脸部形状可以用声音来重建。 实验结果表明, 3D 脸部可以用声音重建, 我们的方法可以改善基线上的性能。 最佳的成绩( 15 - 20 %) 和 个人的直径比直径数据只能用整个直径平比 。