Previous text-to-4D methods have leveraged multiple Score Distillation Sampling (SDS) techniques, combining motion priors from video-based diffusion models (DMs) with geometric priors from multiview DMs to implicitly guide 4D renderings. However, differences in these priors result in conflicting gradient directions during optimization, causing trade-offs between motion fidelity and geometry accuracy, and requiring substantial optimization time to reconcile the models. In this paper, we introduce \textbf{P}ixel-\textbf{L}evel \textbf{A}lignment for text-driven \textbf{4D} Gaussian splatting (PLA4D) to resolve this motion-geometry conflict. PLA4D provides an anchor reference, i.e., text-generated video, to align the rendering process conditioned by different DMs in pixel space. For static alignment, our approach introduces a focal alignment method and Gaussian-Mesh contrastive learning to iteratively adjust focal lengths and provide explicit geometric priors at each timestep. At the dynamic level, a motion alignment technique and T-MV refinement method are employed to enforce both pose alignment and motion continuity across unknown viewpoints, ensuring intrinsic geometric consistency across views. With such pixel-level multi-DM alignment, our PLA4D framework is able to generate 4D objects with superior geometric, motion, and semantic consistency. Fully implemented with open-source tools, PLA4D offers an efficient and accessible solution for high-quality 4D digital content creation with significantly reduced generation time.
翻译:暂无翻译