We present a conditional estimation (CEST) framework to learn 3D facial parameters from 2D single-view images by self-supervised training from videos. CEST is based on the process of analysis by synthesis, where the 3D facial parameters (shape, reflectance, viewpoint, and illumination) are estimated from the face image, and then recombined to reconstruct the 2D face image. In order to learn semantically meaningful 3D facial parameters without explicit access to their labels, CEST couples the estimation of different 3D facial parameters by taking their statistical dependency into account. Specifically, the estimation of any 3D facial parameter is not only conditioned on the given image, but also on the facial parameters that have already been derived. Moreover, the reflectance symmetry and consistency among the video frames are adopted to improve the disentanglement of facial parameters. Together with a novel strategy for incorporating the reflectance symmetry and consistency, CEST can be efficiently trained with in-the-wild video clips. Both qualitative and quantitative experiments demonstrate the effectiveness of CEST.
翻译:我们提出了一个有条件的估计(CEST)框架,通过视频自我监督的培训,从 2D 单视图像中学习3D 面部参数。 CEST 是基于合成分析过程,3D 面部参数(形状、反射、视图和光化)是从脸部图像中估算出来的,然后重新组合以重建 2D 面部图像。为了在没有明确接触标签的情况下学习具有意义的3D 面部参数,CEST 夫妇可以考虑到统计依赖性,对不同的3D 面部参数进行估算。具体地说,任何3D 面部参数的估算不仅以给定图像为条件,而且还以已经得出的面部参数为条件。此外,采用视频框架的反射对称和一致性来改进面部参数的脱钩。除了纳入反射对称和一致性的新战略外,CEST 还可以有效地培训各种3D 3D 面部图像剪辑。 定性和定量实验都展示了CEST 的有效性。