Recently, transformer networks have outperformed traditional deep neural networks in natural language processing and show a large potential in many computer vision tasks compared to convolutional backbones. In the original transformer, readout tokens are used as designated vectors for aggregating information from other tokens. However, the performance of using readout tokens in a vision transformer is limited. Therefore, we propose a novel fusion strategy to integrate radar data into a dense prediction transformer network by reassembling camera representations with radar representations. Instead of using readout tokens, radar representations contribute additional depth information to a monocular depth estimation model and improve performance. We further investigate different fusion approaches that are commonly used for integrating additional modality in a dense prediction transformer network. The experiments are conducted on the nuScenes dataset, which includes camera images, lidar, and radar data. The results show that our proposed method yields better performance than the commonly used fusion strategies and outperforms existing convolutional depth estimation models that fuse camera images and radar.
翻译:最近,变压器网络在自然语言处理中的表现超过了传统的深神经网络,并显示出与进化骨干相比许多计算机视觉任务的巨大潜力。在原变压器中,读出符号被用作指定矢量,用于汇总来自其他象征的信息。然而,在视觉变压器中使用读出符号的性能有限。因此,我们提出一种新的聚合战略,通过将雷达图象与雷达图象相配,将雷达数据整合到密集的预测变压器网络中。雷达图象不是使用读出符号,而是为单镜深度估计模型提供更多的深度信息,并改进性能。我们进一步调查在密集预测变压器网络中整合更多模式时通常使用的不同聚集法。实验是在核星数据集上进行的,其中包括相机图像、里达和雷达数据。结果显示,我们提出的方法比常用的聚变电战略产生更好的性能,并且超越了现有电动深度估计模型,即集成相机图象和雷达。</s>