Multi-view radar-camera fused 3D object detection provides a farther detection range and more helpful features for autonomous driving, especially under adverse weather. The current radar-camera fusion methods deliver kinds of designs to fuse radar information with camera data. However, these fusion approaches usually adopt the straightforward concatenation operation between multi-modal features, which ignores the semantic alignment with radar features and sufficient correlations across modals. In this paper, we present MVFusion, a novel Multi-View radar-camera Fusion method to achieve semantic-aligned radar features and enhance the cross-modal information interaction. To achieve so, we inject the semantic alignment into the radar features via the semantic-aligned radar encoder (SARE) to produce image-guided radar features. Then, we propose the radar-guided fusion transformer (RGFT) to fuse our radar and image features to strengthen the two modals' correlation from the global scope via the cross-attention mechanism. Extensive experiments show that MVFusion achieves state-of-the-art performance (51.7% NDS and 45.3% mAP) on the nuScenes dataset. We shall release our code and trained networks upon publication.
翻译:多视角雷达摄像机引信3D天体探测为自动驾驶提供了更远的探测范围和更有用的特性,特别是在恶劣的天气下。目前的雷达摄像聚变方法提供各种设计,使雷达信息与摄像数据相结合。然而,这些聚变方法通常采用多模式特征之间的直接结合操作,忽视了与雷达特征的语义匹配和不同模式之间充分的相关性。在本文中,我们展示了MVFusion,这是实现语义一致雷达特征和加强跨模式信息互动的新型多视图雷达摄像头聚合方法。为了实现这一点,我们通过语义一致雷达摄像头(SARCE)将语义对齐输入雷达特征中,以产生图像引导雷达特征。然后,我们提议雷达引导聚变变器(RGFT)结合我们的雷达和图像特征,以便通过交叉注意机制加强两种模式与全球范围的相关性。广泛的实验显示,MVFusion能够实现州-状态雷达特征的连接,通过语调雷达探测器(51.7MDS)和我们所培训的NCAAP网络上的数据发布数据(51MDS 3和45)。