Multispectral image pairs can provide the combined information, making object detection applications more reliable and robust in the open world. To fully exploit the different modalities, we present a simple yet effective cross-modality feature fusion approach, named Cross-Modality Fusion Transformer (CFT) in this paper. Unlike prior CNNs-based works, guided by the transformer scheme, our network learns long-range dependencies and integrates global contextual information in the feature extraction stage. More importantly, by leveraging the self attention of the transformer, the network can naturally carry out simultaneous intra-modality and inter-modality fusion, and robustly capture the latent interactions between RGB and Thermal domains, thereby significantly improving the performance of multispectral object detection. Extensive experiments and ablation studies on multiple datasets demonstrate that our approach is effective and achieves state-of-the-art detection performance. Our code and models are available at https://github.com/DocF/multispectral-object-detection.
翻译:多光谱图像配对可以提供综合信息,使天体探测应用在开放世界中更加可靠和强大。为了充分利用不同模式,我们在本文中展示了一个简单而有效的跨现代特征融合方法,名为“跨模式融合变异器 ” ( Cross-Modality Fusion 变异器 ) 。与以前在变压器计划指导下的CNN工作不同,我们的网络学习长距离依赖关系,并将全球背景信息纳入地貌提取阶段。更重要的是,通过利用变异器的自我关注,网络可以自然地同时进行时时装和时装融合,并强有力地捕捉RGB和热域之间的潜在互动,从而大大改进多光谱天体探测的性能。关于多个数据集的广泛实验和调整研究表明,我们的方法是有效的,并实现了最先进的探测性能。我们的代码和模型可以在 https://github.com/DocF/ multi光谱点-object-dectionction。