To achieve accurate 3D object detection at a low cost for autonomous driving, many multi-camera methods have been proposed and solved the occlusion problem of monocular approaches. However, due to the lack of accurate estimated depth, existing multi-camera methods often generate multiple bounding boxes along a ray of depth direction for difficult small objects such as pedestrians, resulting in an extremely low recall. Furthermore, directly applying depth prediction modules to existing multi-camera methods, generally composed of large network architectures, cannot meet the real-time requirements of self-driving applications. To address these issues, we propose Cross-view and Depth-guided Transformers for 3D Object Detection, CrossDTR. First, our lightweight depth predictor is designed to produce precise object-wise sparse depth maps and low-dimensional depth embeddings without extra depth datasets during supervision. Second, a cross-view depth-guided transformer is developed to fuse the depth embeddings as well as image features from cameras of different views and generate 3D bounding boxes. Extensive experiments demonstrated that our method hugely surpassed existing multi-camera methods by 10 percent in pedestrian detection and about 3 percent in overall mAP and NDS metrics. Also, computational analyses showed that our method is 5 times faster than prior approaches. Our codes will be made publicly available at https://github.com/sty61010/CrossDTR.
翻译:为了以低廉的自主驾驶成本实现准确的三维天体探测,已提出许多多镜头方法,并解决了单体方法的隔离问题。然而,由于缺乏准确的估计深度,现有多镜头方法往往产生多个捆绑箱,沿深线线为行人等困难的小物体提供精密方向,导致极低的回想。此外,将深度预测模块直接应用到现有的多镜头方法(通常由大型网络结构组成),无法满足自驾驶应用程序的实时要求。为了解决这些问题,我们提议为3D对象探测、CrossDTR(CrossDTR)提供交叉视图和深度导导变换器。首先,我们的轻重深度预测器设计为生成精确的物体偏差深度地图和低维深度嵌入,而监管期间没有额外的深度数据集。第二,将交叉浏览深度导变压器用于连接深度嵌入,以及由不同观点的相机生成的图像特征,并生成3D-D捆绑箱。广泛的实验表明,我们的方法大大超过现有的多摄像/制变换器方法,即CRex-DR 。首先设计了10%的公制方法,在公检方法中将显示我们以前的RMDSDS/RMDS/RMD(之前的计算方法)。