Besides standard cameras, autonomous vehicles typically include multiple additional sensors, such as lidars and radars, which help acquire richer information for perceiving the content of the driving scene. While several recent works focus on fusing certain pairs of sensors - such as camera and lidar or camera and radar - by using architectural components specific to the examined setting, a generic and modular sensor fusion architecture is missing from the literature. In this work, we focus on 2D object detection, a fundamental high-level task which is defined on the 2D image domain, and propose HRFuser, a multi-resolution sensor fusion architecture that scales straightforwardly to an arbitrary number of input modalities. The design of HRFuser is based on state-of-the-art high-resolution networks for image-only dense prediction and incorporates a novel multi-window cross-attention block as the means to perform fusion of multiple modalities at multiple resolutions. Even though cameras alone provide very informative features for 2D detection, we demonstrate via extensive experiments on the nuScenes and Seeing Through Fog datasets that our model effectively leverages complementary features from additional modalities, substantially improving upon camera-only performance and consistently outperforming state-of-the-art fusion methods for 2D detection both in normal and adverse conditions. The source code will be made publicly available.
翻译:除标准摄像机外,自治车辆通常还包括多个额外的传感器,如激光雷达和雷达,这些传感器有助于获取更丰富的信息,以了解驾驶场的内容。虽然最近一些工作的重点是通过使用受检查环境特有的建筑构件来引信某些传感器,例如照相机和激光雷达或照相机或雷达,但文献中缺少一个通用和模块式传感器聚合结构。在这项工作中,我们侧重于2D物体探测,这是在 2D 图像域界定的一项基本高层次任务,并提议HRFuser,这是一个多分辨率传感器聚合结构,可直达到任意输入模式的数量。HRFuser的设计以最先进的高分辨率网络为基础,用于只显示图像的密集预测,并纳入一个新的多窗口交叉注意区块,作为在多个分辨率上进行多种模式融合的手段。即使摄影机本身就为2D 探测提供了非常丰富的信息性特征,但我们通过广泛的试验和通过Fog数据集展示,我们的模型能够有效地利用其他模式的补充性功能,大大改进了现有摄像器2的正常性能和持续运行状态。