Vehicle-to-Vehicle technologies have enabled autonomous vehicles to share information to see through occlusions, greatly enhancing perception performance. Nevertheless, existing works all focused on homogeneous traffic where vehicles are equipped with the same type of sensors, which significantly hampers the scale of collaboration and benefit of cross-modality interactions. In this paper, we investigate the multi-agent hetero-modal cooperative perception problem where agents may have distinct sensor modalities. We present HM-ViT, the first unified multi-agent hetero-modal cooperative perception framework that can collaboratively predict 3D objects for highly dynamic vehicle-to-vehicle (V2V) collaborations with varying numbers and types of agents. To effectively fuse features from multi-view images and LiDAR point clouds, we design a novel heterogeneous 3D graph transformer to jointly reason inter-agent and intra-agent interactions. The extensive experiments on the V2V perception dataset OPV2V demonstrate that the HM-ViT outperforms SOTA cooperative perception methods for V2V hetero-modal cooperative perception. We will release codes to facilitate future research.
翻译:车辆间通信技术使得自动驾驶车辆能够共享信息以避免遮挡,极大地增强了感知性能。然而,现有研究都集中于同质化交通,各车辆配备相同类型的传感器,这显著阻碍了跨模态交互的规模和收益。本文研究了多智能体异构多模式合作感知问题,其中智能体可能具有不同的传感器模式。我们提出了 HM-ViT,这是第一个统一的多智能体异构多模式合作感知框架,能够协作预测高度动态的车辆间三维物体的变化数量和类型。为了有效地融合多视图图像和 LiDAR 点云的特征,我们设计了一种新颖的异构 3D 图形变换器,共同推理智能体内部和智能体间的交互。在 V2V 感知数据集 OPV2V 上进行的广泛实验表明,HM-ViT 在异构多模式车辆间合作感知方面优于当前最先进的解决方法。我们将发布代码以促进未来的研究。