How should we integrate representations from complementary sensors for autonomous driving? Geometry-based fusion has shown promise for perception (e.g. object detection, motion forecasting). However, in the context of end-to-end driving, we find that imitation learning based on existing sensor fusion methods underperforms in complex driving scenarios with a high density of dynamic agents. Therefore, we propose TransFuser, a mechanism to integrate image and LiDAR representations using self-attention. Our approach uses transformer modules at multiple resolutions to fuse perspective view and bird's eye view feature maps. We experimentally validate its efficacy on a challenging new benchmark with long routes and dense traffic, as well as the official leaderboard of the CARLA urban driving simulator. At the time of submission, TransFuser outperforms all prior work on the CARLA leaderboard in terms of driving score by a large margin. Compared to geometry-based fusion, TransFuser reduces the average collisions per kilometer by 48%.
翻译:我们应如何整合自主驾驶辅助传感器的表达方式?基于几何的聚合方法已显示出对感知的希望(例如物体探测、运动预测)。然而,在端到端驾驶方面,我们发现,基于现有感应聚合方法的模拟学习在动态物剂密度高的复杂驾驶情形中表现不佳。因此,我们提议TranFuser,这是一个利用自我注意整合图像和LIDAR表达方式的机制。我们的方法在多个分辨率上使用变压器模块来引信视角视图和鸟眼视特征图。我们实验性地验证其在具有挑战性的新基准上的效力,有长路线和密集交通,以及CARLA城市驾驶模拟器的正式领导板。在提交时,TransFuser在驾驶分上比CARA领导板以往的所有工作都差很大。与基于几何的聚合方法相比, TransFuser将每公里的平均碰撞减少48%。