In autonomous driving, Vehicle-Infrastructure Cooperative 3D Object Detection (VIC3D) makes use of multi-view cameras from both vehicles and traffic infrastructure, providing a global vantage point with rich semantic context of road conditions beyond a single vehicle viewpoint. Two major challenges prevail in VIC3D: 1) inherent calibration noise when fusing multi-view images, caused by time asynchrony across cameras; 2) information loss when projecting 2D features into 3D space. To address these issues, We propose a novel 3D object detection framework, Vehicles-Infrastructure Multi-view Intermediate fusion (VIMI). First, to fully exploit the holistic perspectives from both vehicles and infrastructure, we propose a Multi-scale Cross Attention (MCA) module that fuses infrastructure and vehicle features on selective multi-scales to correct the calibration noise introduced by camera asynchrony. Then, we design a Camera-aware Channel Masking (CCM) module that uses camera parameters as priors to augment the fused features. We further introduce a Feature Compression (FC) module with channel and spatial compression blocks to reduce the size of transmitted features for enhanced efficiency. Experiments show that VIMI achieves 15.61% overall AP_3D and 21.44% AP_BEV on the new VIC3D dataset, DAIR-V2X-C, significantly outperforming state-of-the-art early fusion and late fusion methods with comparable transmission cost.
翻译:在自动驾驶中,车辆基础设施合作三维物体检测(VIC3D)利用来自车辆和交通基础设施的多视角摄像机,提供超越单一车辆视点的具有丰富语境的全局视角,对道路状况进行了有效利用。VIC3D面临两个主要挑战:1)当融合多视图图像时存在固有的校准噪声,这是由于相机的时间不同步造成的;2)将二维特征投射到三维空间时存在信息损失。为了解决这些问题,我们提出了一种新颖的三维物体检测框架——车辆基础设施多视角中间融合(VIMI)。首先,为了充分利用来自车辆和基础设施的整体视角,我们提出了一种多级交叉注意力(MCA)模块,它在选择性的多个尺度上融合基础设施和车辆特征,以校正相机异步引入的校准噪声。然后,我们设计了一种相机感知通道屏蔽(CCM)模块,使用相机参数作为先验条件,以增强融合特征。我们进一步引入了一个特征压缩(FC)模块,其中包括通道和空间压缩块,以减小传输特征的大小,从而提高效率。实验结果表明,VIMI在新的VIC3D数据集DAIR-V2X-C上实现了15.61%的整体AP_3D和21.44%的AP_BEV,明显优于可比的传输成本的最先进的早期融合和晚期融合方法。