Recognizing and localizing objects in the 3D space is a crucial ability for an AI agent to perceive its surrounding environment. While significant progress has been achieved with expensive LiDAR point clouds, it poses a great challenge for 3D object detection given only a monocular image. While there exist different alternatives for tackling this problem, it is found that they are either equipped with heavy networks to fuse RGB and depth information or empirically ineffective to process millions of pseudo-LiDAR points. With in-depth examination, we realize that these limitations are rooted in inaccurate object localization. In this paper, we propose a novel and lightweight approach, dubbed Progressive Coordinate Transforms (PCT) to facilitate learning coordinate representations. Specifically, a localization boosting mechanism with confidence-aware loss is introduced to progressively refine the localization prediction. In addition, semantic image repre- sentation is also exploited to compensate for the usage of patch proposals. Despite being lightweight and simple, our strategy leads to superior improvements on the KITTI and Waymo Open Dataset monocular 3D detection benchmarks. At the same time, our proposed PCT shows great generalization to most coordinate- based 3D detection frameworks. The code is available at: https://github.com/ amazon-research/progressive-coordinate-transforms.
翻译:在 3D 空间中确认和定位物体是AI 代理机构感知周围环境的关键能力。 虽然使用昂贵的利达雷达点云已经取得了显著进展,但它给三维天体探测带来了巨大的挑战,因为只有一幅单幅图像。虽然有不同的办法来解决这一问题,但发现它们要么配备了连接RGB的重网络和深度信息,要么在处理数百万个伪LiDAR点方面具有经验上的无效作用。经过深入的检查,我们认识到这些限制的根源在于不准确的物体定位。在本文中,我们提议了一种新颖和轻量级的方法,称为进步协调变换(PCT),以促进学习协调演示。具体地说,为逐步完善本地化预测引入了具有信心损失的本地化增强机制。此外,语义图像重新发送还被用来补偿修补建议的使用。尽管我们的战略是轻而简单,但我们的战略导致KITTI和Waymo Open DSet 单立 3D 检测基准得到更好的改进。同时,我们提议的PCT 展示了以最大范围化为协调基础的3DRiscostrain/Misgregal 3D 。