Crowd counting research has made significant advancements in real-world applications, but it remains a formidable challenge in cross-modal settings. Most existing methods rely solely on the optical features of RGB images, ignoring the feasibility of other modalities such as thermal and depth images. The inherently significant differences between the different modalities and the diversity of design choices for model architectures make cross-modal crowd counting more challenging. In this paper, we propose Cross-modal Spatio-Channel Attention (CSCA) blocks, which can be easily integrated into any modality-specific architecture. The CSCA blocks first spatially capture global functional correlations among multi-modality with less overhead through spatial-wise cross-modal attention. Cross-modal features with spatial attention are subsequently refined through adaptive channel-wise feature aggregation. In our experiments, the proposed block consistently shows significant performance improvement across various backbone networks, resulting in state-of-the-art results in RGB-T and RGB-D crowd counting.
翻译:众人计数研究在现实世界应用方面取得了显著进步,但在跨模式环境中仍是一项艰巨的挑战。大多数现有方法完全依赖RGB图像的光学特征,忽视了热和深度图像等其他模式的可行性。模型结构的不同模式和设计选择的多样性之间固有的巨大差异使得跨模式的人群计数更具挑战性。在本文中,我们提议跨模式Spatio-Channe 注意区块,这些区块可以很容易地融入任何特定模式的架构。CSCA区块首先通过空间-智能跨模式的注意从空间角度获取多模式之间的全球功能相关性,随后通过适应性渠道特征集成来完善具有空间关注的跨模式特征。在我们的实验中,拟议的区块始终显示各主干网的性能显著改善,从而在RGB-T和RGB-D群落计数方面产生了最新的结果。