Production-level workflows for producing convincing 3D dynamic human faces have long relied on a disarray of labor-intensive tools for geometry and texture generation, motion capture and rigging, and expression synthesis. Recent neural approaches automate individual components but the corresponding latent representations cannot provide artists explicit controls as in conventional tools. In this paper, we present a new learning-based, video-driven approach for generating dynamic facial geometries with high-quality physically-based assets. Two key components are well-structured latent spaces due to dense temporal samplings from videos and explicit facial expression controls to regulate the latent spaces. For data collection, we construct a hybrid multiview-photometric capture stage, coupling with an ultra-fast video camera to obtain raw 3D facial assets. We then model the facial expression, geometry and physically-based textures using separate VAEs with a global MLP-based expression mapping across the latent spaces, to preserve characteristics across respective attributes while maintaining explicit controls over geometry and texture. We also introduce to model the delta information as wrinkle maps for physically-base textures, achieving high-quality rendering of dynamic textures. We demonstrate our approach in high-fidelity performer-specific facial capture and cross-identity facial motion retargeting. In addition, our neural asset along with fast adaptation schemes can also be deployed to handle in-the-wild videos. Besides, we motivate the utility of our explicit facial disentangle strategy by providing promising physically-based editing results like geometry and material editing or winkle transfer with high realism. Comprehensive experiments show that our technique provides higher accuracy and visual fidelity than previous video-driven facial reconstruction and animation methods.
翻译:生产令人信服的 3D 动态人类面孔的生产水平工作流程长期以来依赖于劳动密集型工具的混乱,这些工具用于几何和纹理生成、运动抓捕和操纵以及表达合成。 最近的神经将个体组件自动化,但相应的潜在代表无法像常规工具一样为艺术家提供明确的控制。 在本文中,我们展示了一种新的基于学习的视频驱动方法,以产生动态面部偏差,并具有高质量的物理资产。 两个关键组成部分是结构完善的潜在空间,因为视频和清晰的面部表达控制对潜在空间进行了密集的时间抽样抽样,从而调节了潜在的空间。 在数据收集方面,我们构建了一个混合的多视光度抓取阶段,与超快的视频相机合并,以获得原始的 3D 面部的面部资产。 然后我们用不同的 VAE 模拟面部、 和基于全球 MLP 的表达式绘图, 以维护各个属性的特征,同时保持对地貌和纹的清晰控制。 我们还将三角信息引入模型,作为基于物理直径直径直径直径直径直径直径直径直径直径的地图图, 实现高质量直径直径直径直径直径直径直径直径直径直径直径直径直径直路路路路路路路路路路路路路。