This paper presents an investigation of vision transformer learning for multi-view geometry tasks, such as optical flow estimation, by fine-tuning video foundation models. Unlike previous methods that involve custom architectural designs and task-specific pretraining, our research finds that general-purpose models pretrained on videos can be readily transferred to multi-view problems with minimal adaptation. The core insight is that general-purpose attention between patches learns temporal and spatial information for geometric reasoning. We demonstrate that appending a linear decoder to the Transformer backbone produces satisfactory results, and iterative refinement can further elevate performance to stateof-the-art levels. This conceptually simple approach achieves top cross-dataset generalization results for optical flow estimation with end-point error (EPE) of 0.69, 1.78, and 3.15 on the Sintel clean, Sintel final, and KITTI datasets, respectively. Our method additionally establishes a new record on the online test benchmark with EPE values of 0.79, 1.88, and F1 value of 3.79. Applications to 3D depth estimation and stereo matching also show strong performance, illustrating the versatility of video-pretrained models in addressing geometric vision tasks.
翻译:本文研究了通过微调视频基础模型,将视觉Transformer学习应用于多视图几何任务(如光流估计)。与以往需要定制架构设计和任务特定预训练的方法不同,我们的研究发现,在视频上预训练的通用模型只需极少调整即可迁移至多视图问题。核心洞见在于:图像块之间的通用注意力机制能够学习用于几何推理的时空信息。我们证明,在Transformer骨干网络后附加线性解码器即可获得满意结果,而迭代细化能进一步将性能提升至最先进水平。这一概念简洁的方法在光流估计的跨数据集泛化中取得了最优结果,在Sintel clean、Sintel final和KITTI数据集上的端点误差(EPE)分别为0.69、1.78和3.15。我们的方法还在在线测试基准上创造了新记录,EPE值达到0.79和1.88,F1值为3.79。在三维深度估计和立体匹配任务中的应用也展现出强大性能,证明了视频预训练模型在解决几何视觉任务中的广泛适用性。