Monocular image-based 3D reconstruction of faces is a long-standing problem in computer vision. Since image data is a 2D projection of a 3D face, the resulting depth ambiguity makes the problem ill-posed. Most existing methods rely on data-driven priors that are built from limited 3D face scans. In contrast, we propose multi-frame video-based self-supervised training of a deep network that (i) learns a face identity model both in shape and appearance while (ii) jointly learning to reconstruct 3D faces. Our face model is learned using only corpora of in-the-wild video clips collected from the Internet. This virtually endless source of training data enables learning of a highly general 3D face model. In order to achieve this, we propose a novel multi-frame consistency loss that ensures consistent shape and appearance across multiple frames of a subject's face, thus minimizing depth ambiguity. At test time we can use an arbitrary number of frames, so that we can perform both monocular as well as multi-frame reconstruction.
翻译:以单体图像为基础的 3D 面部重建是计算机视觉中长期存在的问题。 由于图像数据是三维面部的二维投影, 由此产生的深度模糊性使得问题被错误地覆盖。 大多数现有方法都依赖于由有限的三维面部扫描所建立的数据驱动的前科。 相反, 我们建议对深层网络进行多框架视频自我监督培训, (一) 在形状和外观两方面学习面部身份模型, (二) 共同学习重建三维面部。 我们的面部模型只使用从互联网收集的单体视频剪辑体来学习。 这几乎是无休止的培训数据源, 使得能够学习高度通用的三维面部脸部模型。 为了实现这一点, 我们提出了一个新的多框架一致性损失, 以确保在主题面部的多个框架上保持一致的形状和外观, 从而将深度模糊性最小化。 在测试时, 我们可以使用任意的框数, 来进行单体和多框架的重建。