Recognizing and localizing objects in the 3D space is a crucial ability for an AI agent to perceive its surrounding environment. While significant progress has been achieved with expensive LiDAR point clouds, it poses a great challenge for 3D object detection given only a monocular image. While there exist different alternatives for tackling this problem, it is found that they are either equipped with heavy networks to fuse RGB and depth information or empirically ineffective to process millions of pseudo-LiDAR points. With in-depth examination, we realize that these limitations are rooted in inaccurate object localization. In this paper, we propose a novel and lightweight approach, dubbed {\em Progressive Coordinate Transforms} (PCT) to facilitate learning coordinate representations. Specifically, a localization boosting mechanism with confidence-aware loss is introduced to progressively refine the localization prediction. In addition, semantic image representation is also exploited to compensate for the usage of patch proposals. Despite being lightweight and simple, our strategy leads to superior improvements on the KITTI and Waymo Open Dataset monocular 3D detection benchmarks. At the same time, our proposed PCT shows great generalization to most coordinate-based 3D detection frameworks. The code is available at: https://github.com/amazon-research/progressive-coordinate-transforms .
翻译:在 3D 空间中确认和定位物体是AI 代理机构感知周围环境的关键能力。 虽然使用昂贵的 LiDAR 点云已经取得了显著进展,但仅用一个单体图像对三维天体的探测构成巨大挑战。 虽然存在解决这一问题的不同选择,但发现它们要么配备了连接RGB的重网络和深度信息,或者在处理数百万个伪LiDAR 点时在经验上是无效的。通过深入检查,我们认识到这些限制的根源在于不准确的物体定位。在本文中,我们提议采用新颖和轻量级的方法,称为“进步坐标变换”(PCT),以促进学习协调演示。具体地说,引入了具有信心损失的促进本地化机制,以逐步完善本地化预测。此外,语义图像代表也被用来补偿修补建议的使用。尽管我们的战略是轻而简单,但我们的战略导致KITTI 和Waymo Open Dset 单体3D 检测基准得到更好的改进。同时,我们提议的PCT 显示最普遍的通用/MSBSBAD 3D 。MRisgroadal 3D 。