We propose DeepFusion, a modular multi-modal architecture to fuse lidars, cameras and radars in different combinations for 3D object detection. Specialized feature extractors take advantage of each modality and can be exchanged easily, making the approach simple and flexible. Extracted features are transformed into bird's-eye-view as a common representation for fusion. Spatial and semantic alignment is performed prior to fusing modalities in the feature space. Finally, a detection head exploits rich multi-modal features for improved 3D detection performance. Experimental results for lidar-camera, lidar-camera-radar and camera-radar fusion show the flexibility and effectiveness of our fusion approach. In the process, we study the largely unexplored task of faraway car detection up to 225~meters, showing the benefits of our lidar-camera fusion. Furthermore, we investigate the required density of lidar points for 3D object detection and illustrate implications at the example of robustness against adverse weather conditions. Moreover, ablation studies on our camera-radar fusion highlight the importance of accurate depth estimation.
翻译:我们提议DhiepFusion, 这是一个模块化的多式结构,用于三维物体探测的不同组合的Lidar、相机和雷达的引信、摄像头和雷达。 特殊地物提取器利用每种模式,并且可以容易地交换,使方法简单而灵活。 提取的地物被转化成鸟眼观,作为聚变的共同表示。 空间和语义对齐在地物空间的引信模式之前进行。 最后, 探测头利用丰富的多式功能来改进3D探测性能。 Lidar- camera、 lidar- camera-radar 和相机- radar 聚变实验结果显示了我们聚变方法的灵活性和有效性。 在这一过程中, 我们研究远处的汽车探测任务基本上未爆炸, 高达225~ 厘米, 展示了我们的Lidar- camera 聚变的好处。 此外, 我们调查了3D 对象探测所需的利达尔点密度, 并展示了对不利天气条件的稳健性实例的影响。 此外, 我们的摄像- 雷达聚变研究突出了准确的深度估计的重要性。