In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
翻译:在本文中,我们提议建立一个强有力的三维探测器,名为Cross Modal变形器(CMT),用于终端到终端三维多式检测。在没有明确的视图转换的情况下, CMT将图像和点云符号作为输入和直接输出准确的三维约束框。通过将三维点编码成多模式特征,对多模式符号进行空间调整是隐含的。 CMT的核心设计非常简单,其性能令人印象深刻。 CMT在 nuScenes 基准上获得了73.0% NDS。 此外, CMT即使缺少利DAR, 也具有很强的坚固性。 代码将在 https://github.com/junjie18/CMT上发布 。