While recent camera-only 3D detection methods leverage multiple timesteps, the limited history they use significantly hampers the extent to which temporal fusion can improve object perception. Observing that existing works' fusion of multi-frame images are instances of temporal stereo matching, we find that performance is hindered by the interplay between 1) the low granularity of matching resolution and 2) the sub-optimal multi-view setup produced by limited history usage. Our theoretical and empirical analysis demonstrates that the optimal temporal difference between views varies significantly for different pixels and depths, making it necessary to fuse many timesteps over long-term history. Building on our investigation, we propose to generate a cost volume from a long history of image observations, compensating for the coarse but efficient matching resolution with a more optimal multi-view matching setup. Further, we augment the per-frame monocular depth predictions used for long-term, coarse matching with short-term, fine-grained matching and find that long and short term temporal fusion are highly complementary. While maintaining high efficiency, our framework sets new state-of-the-art on nuScenes, achieving first place on the test set and outperforming previous best art by 5.2% mAP and 3.7% NDS on the validation set. Code will be released $\href{https://github.com/Divadi/SOLOFusion}{here.}$
翻译:虽然最近仅摄像的3D探测方法能够影响多个时间步,但它们使用的历史有限,严重妨碍了时间融合能够改善对象感知的程度。我们注意到现有作品的多框架图像融合是时间立体匹配的事例,我们发现由于以下相互作用而妨碍业绩:(1) 匹配分辨率的低颗粒和(2) 历史用量有限产生的亚最佳多视图设置。我们的理论和实验分析表明,不同像素和深度的观点之间最佳时间差异差异很大,因此有必要结合长期历史的许多时间步。根据我们的调查,我们提议从长期图像观测历史中产生成本量,用更优化的多视图匹配配置来补偿粗糙但有效的匹配分辨率。此外,我们增加用于长期、粗略匹配的单色深度预测,并发现长期和短期时间融合是高度互补的。在保持高效率的同时,我们的框架在Nus-Scenius/comius上设置了新状态, 以更优的NSecefion-DA/Mspractival