高效重构动态场景：一次一个D4RT (Efficiently Reconstructing Dynamic Scenes One D4RT at a Time)

Chuhan Zhang,Guillaume Le Moing,Skanda Koppula,Ignacio Rocco,Liliane Momeni,Junyu Xie,Shuyang Sun,Rahul Sukthankar,Joëlle K Barral,Raia Hadsell,Zoubin Ghahramani,Andrew Zisserman,Junlin Zhang,Mehdi SM Sajjadi

from arxiv, Project Page: https://d4rt-paper.github.io/

Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks. We refer to the project webpage for animated results: https://d4rt-paper.github.io/.

翻译：从视频中理解和重构动态场景的复杂几何与运动，仍然是计算机视觉领域的一项艰巨挑战。本文介绍了D4RT，一个简单而强大的前馈模型，旨在高效解决此任务。D4RT采用统一的Transformer架构，从单个视频中联合推断深度、时空对应关系以及完整的相机参数。其核心创新是一种新颖的查询机制，避免了密集逐帧解码的繁重计算以及管理多个任务特定解码器的复杂性。我们的解码接口允许模型独立且灵活地探测空间和时间中任意点的3D位置。其结果是一种轻量级且高度可扩展的方法，实现了显著高效的训练和推理。我们证明，我们的方法在广泛的4D重构任务中超越了先前方法，确立了新的技术前沿。动画结果请参见项目网页：https://d4rt-paper.github.io/。