In this paper, we study the problem of jointly estimating the optical flow and scene flow from synchronized 2D and 3D data. Previous methods either employ a complex pipeline that splits the joint task into independent stages, or fuse 2D and 3D information in an ``early-fusion'' or ``late-fusion'' manner. Such one-size-fits-all approaches suffer from a dilemma of failing to fully utilize the characteristic of each modality or to maximize the inter-modality complementarity. To address the problem, we propose a novel end-to-end framework, which consists of 2D and 3D branches with multiple bidirectional fusion connections between them in specific layers. Different from previous work, we apply a point-based 3D branch to extract the LiDAR features, as it preserves the geometric structure of point clouds. To fuse dense image features and sparse point features, we propose a learnable operator named bidirectional camera-LiDAR fusion module (Bi-CLFM). We instantiate two types of the bidirectional fusion pipeline, one based on the pyramidal coarse-to-fine architecture (dubbed CamLiPWC), and the other one based on the recurrent all-pairs field transforms (dubbed CamLiRAFT). On FlyingThings3D, both CamLiPWC and CamLiRAFT surpass all existing methods and achieve up to a 47.9\% reduction in 3D end-point-error from the best published result. Our best-performing model, CamLiRAFT, achieves an error of 4.26\% on the KITTI Scene Flow benchmark, ranking 1st among all submissions with much fewer parameters. Besides, our methods have strong generalization performance and the ability to handle non-rigid motion. Code is available at https://github.com/MCG-NJU/CamLiFlow.
翻译:本文研究了利用同步的二维和三维数据进行光流和场景流联合估计的问题。先前的方法要么采用将联合任务分解成独立阶段的复杂流程,要么以"早期融合"或"晚期融合"的方式融合2D和3D信息。这种一刀切的方法存在一个问题,即无法充分利用每种模态的特征或发挥跨模态互补优势。为了解决这个问题,我们提出了一个新颖的端到端框架,它包括2D和3D分支,两者之间在特定层之间使用多个双向融合连接。不同于以往的工作,我们应用基于点的3D分支来提取LiDAR特征,因为它能够保留点云的几何结构。为了融合密集的图像特征和稀疏的点特征,我们提出了一个可学习的算子,称为双向相机-LiDAR融合模块(Bi-CLFM)。我们实例化了两种双向融合管道类型,一个基于金字塔粗到细的架构(称为CamLiPWC),另一个基于经典的循环场变换(称为CamLiRAFT)。在FlyingThings3D的数据集中,无论是CamLiPWC还是CamLiRAFT都超过了现有方法,并在三维端点误差方面实现了多达47.9%的降低。我们的最佳模型CamLiRAFT在KITTI Scene Flow基准测试中实现了4.26%的误差,排名在所有提交中第一,且参数更少。此外,我们的方法具有较强的泛化性能和处理非刚性运动的能力。源代码可在https://github.com/MCG-NJU/CamLiFlow找到。