3D object detection from multiple image views is a fundamental and challenging task for visual scene understanding. Due to its low cost and high efficiency, multi-view 3D object detection has demonstrated promising application prospects. However, accurately detecting objects through perspective views in the 3D space is extremely difficult due to the lack of depth information. Recently, DETR3D introduces a novel 3D-2D query paradigm in aggregating multi-view images for 3D object detection and achieves state-of-the-art performance. In this paper, with intensive pilot experiments, we quantify the objects located at different regions and find that the "truncated instances" (i.e., at the border regions of each image) are the main bottleneck hindering the performance of DETR3D. Although it merges multiple features from two adjacent views in the overlapping regions, DETR3D still suffers from insufficient feature aggregation, thus missing the chance to fully boost the detection performance. In an effort to tackle the problem, we propose Graph-DETR3D to automatically aggregate multi-view imagery information through graph structure learning (GSL). It constructs a dynamic 3D graph between each object query and 2D feature maps to enhance the object representations, especially at the border regions. Besides, Graph-DETR3D benefits from a novel depth-invariant multi-scale training strategy, which maintains the visual depth consistency by simultaneously scaling the image size and the object depth. Extensive experiments on the nuScenes dataset demonstrate the effectiveness and efficiency of our Graph-DETR3D. Notably, our best model achieves 49.5 NDS on the nuScenes test leaderboard, achieving new state-of-the-art in comparison with various published image-view 3D object detectors.
翻译:从多个图像视图中检测 3D 对象对于视觉场景理解是一项根本性且具有挑战性的任务。 多视图 3D 对象检测由于其成本低且效率高,显示了有希望的应用前景。 然而,由于缺少深度信息,在 3D 空间中通过视角观测准确检测物体极为困难。 最近, DETR3D 引入了一个新的 3D-2D 查询模式, 将多视图图像集成用于3D 对象检测并实现最先进的性能。 在本文中, 通过密集的试点实验, 我们量化了位于不同区域的物体, 发现“ 破解的事例”(即每张图像的边境地区) 是阻碍 DETR3D 性能的主要瓶颈。 尽管它将重叠区域的两个相邻视图中的多个特性融合在一起, DETR3D 仍然缺乏充分提升检测性能的机会。 为了解决这一问题, 我们建议SG- DETR3D 通过图表结构学习( GSL) 自动汇总多视图图像信息。 它在每个对象的3D 3D 上构建了一个动态 目标, 在2TR3D 深度深度 上, 展示了我们的最新图像测试区域, 提升的图像图图图图 提升 将提升 提高 的图像图图 和图像图 提升 提高。