Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks. We refer to the project webpage for animated results: https://d4rt-paper.github.io/.
翻译:从视频中理解和重建动态场景的复杂几何结构与运动,仍然是计算机视觉领域的一项艰巨挑战。本文提出D4RT,一种简单而强大的前馈模型,旨在高效解决此任务。D4RT采用统一的Transformer架构,从单个视频中联合推断深度、时空对应关系以及完整的相机参数。其核心创新是一种新颖的查询机制,它绕过了密集逐帧解码的繁重计算以及管理多个任务特定解码器的复杂性。我们的解码接口使模型能够独立且灵活地探测空间和时间中任意点的三维位置。由此产生了一种轻量级且高度可扩展的方法,实现了显著高效的训练与推理。我们证明,该方法在广泛的4D重建任务中超越了先前方法,确立了新的技术前沿。动态结果请参见项目网页:https://d4rt-paper.github.io/。