We present FlexAvatar, a flexible large reconstruction model for high-fidelity 3D head avatars with detailed dynamic deformation from single or sparse images, without requiring camera poses or expression labels. It leverages a transformer-based reconstruction model with structured head query tokens as canonical anchor to aggregate flexible input-number-agnostic, camera-pose-free and expression-free inputs into a robust canonical 3D representation. For detailed dynamic deformation, we introduce a lightweight UNet decoder conditioned on UV-space position maps, which can produce detailed expression-dependent deformations in real time. To better capture rare but critical expressions like wrinkles and bared teeth, we also adopt a data distribution adjustment strategy during training to balance the distribution of these expressions in the training set. Moreover, a lightweight 10-second refinement can further enhances identity-specific details in extreme identities without affecting deformation quality. Extensive experiments demonstrate that our FlexAvatar achieves superior 3D consistency, detailed dynamic realism compared with previous methods, providing a practical solution for animatable 3D avatar creation.
翻译:本文提出FlexAvatar,一种灵活的大型重建模型,用于从单张或稀疏图像中重建具有精细动态形变的高保真3D头部化身,且无需相机位姿或表情标签。该模型采用基于Transformer的重建架构,通过结构化头部查询令牌作为规范锚点,将灵活数量、无需相机位姿且无表情标签的输入聚合为鲁棒的规范3D表示。为实现精细动态形变,我们引入了一个以UV空间位置图为条件的轻量级UNet解码器,能够实时生成依赖表情的精细形变。为更好捕捉皱纹、露齿等罕见但关键的表情特征,我们在训练中采用数据分布调整策略以平衡训练集中此类表情的分布。此外,通过轻量级的10秒微调流程,可在不影响形变质量的前提下进一步增强极端身份特征的身份特异性细节。大量实验表明,相较于现有方法,FlexAvatar在3D一致性、精细动态真实感方面均表现优异,为可动画3D化身的创建提供了实用解决方案。