How should representations from complementary sensors be integrated for autonomous driving? Geometry-based sensor fusion has shown great promise for perception tasks such as object detection and motion forecasting. However, for the actual driving task, the global context of the 3D scene is key, e.g. a change in traffic light state can affect the behavior of a vehicle geometrically distant from that traffic light. Geometry alone may therefore be insufficient for effectively fusing representations in end-to-end driving models. In this work, we demonstrate that imitation learning policies based on existing sensor fusion methods under-perform in the presence of a high density of dynamic agents and complex scenarios, which require global contextual reasoning, such as handling traffic oncoming from multiple directions at uncontrolled intersections. Therefore, we propose TransFuser, a novel Multi-Modal Fusion Transformer, to integrate image and LiDAR representations using attention. We experimentally validate the efficacy of our approach in urban settings involving complex scenarios using the CARLA urban driving simulator. Our approach achieves state-of-the-art driving performance while reducing collisions by 76% compared to geometry-based fusion.
翻译:以几何测量为基础的传感器聚合对物体探测和运动预测等感知任务大有希望,然而,对于实际驾驶任务而言,三维场景的全球背景是关键所在,例如,交通灯光状态的变化会影响远离交通灯光的车辆的行为。因此,仅靠几何本身可能不足以有效地在终端到终端驾驶模型中将演示结果引信化。在这项工作中,我们证明,基于现有感应聚变方法的模仿学习政策在动态剂和复杂情景高度密集的情况下表现不佳,这需要全球背景推理,例如处理来自不受控制的交叉点多方向的交通。因此,我们建议TransFuser、一个新的多式多式变形变形器,利用注意力整合图像和激光雷达表。我们用CAR城市驱动模拟器实验地验证了我们在城市环境中采用复杂情景的方法的有效性。我们的方法在使用CARLA城市驱动模拟器时,取得了最先进的驾驶性,同时将碰撞率降低76%,而与基于几何定位的反形反形。