通过带有隐性空间校准的变形器实现目标定位 (Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration)

Weakly Supervised Object Localization (WSOL), which aims to localize objects by only using image-level labels, has attracted much attention because of its low annotation cost in real applications. Recent studies leverage the advantage of self-attention in visual Transformer for long-range dependency to re-active semantic regions, aiming to avoid partial activation in traditional class activation mapping (CAM). However, the long-range modeling in Transformer neglects the inherent spatial coherence of the object, and it usually diffuses the semantic-aware regions far from the object boundary, making localization results significantly larger or far smaller. To address such an issue, we introduce a simple yet effective Spatial Calibration Module (SCM) for accurate WSOL, incorporating semantic similarities of patch tokens and their spatial relationships into a unified diffusion model. Specifically, we introduce a learnable parameter to dynamically adjust the semantic correlations and spatial context intensities for effective information propagation. In practice, SCM is designed as an external module of Transformer, and can be removed during inference to reduce the computation cost. The object-sensitive localization ability is implicitly embedded into the Transformer encoder through optimization in the training phase. It enables the generated attention maps to capture the sharper object boundaries and filter the object-irrelevant background area. Extensive experimental results demonstrate the effectiveness of the proposed method, which significantly outperforms its counterpart TS-CAM on both CUB-200 and ImageNet-1K benchmarks. The code is available at https://github.com/164140757/SCM.

翻译：微弱监督对象本地化 (WSOL) 旨在仅使用图像级标签将物体本地化,但因其在真实应用中的批注成本低而吸引了人们的极大关注。最近的研究利用视觉变异器对长距离依赖性进行自我关注的优势,将长期依赖性与重新激活的语义区域结合起来,以避免传统级激活绘图(CAM)部分激活。然而,变异器中的长距离模型忽略了该物体固有的空间一致性,通常会将语义-认知区域分散到远离对象边界的区域,使本地化结果大得多或小得多。为了解决这样一个问题,我们引入了一个简单而有效的空间校正校准模块(SCMM),将补丁符号及其空间关系的语义性相似性相似性整合到一个统一的传播模型中。我们引入了一个可学习的参数,以动态方式调整该物体的语义相关性和空间环境强度,以有效信息传播。在实践中,SCMMED被设计成一个外部变异器模块,在推断对象点上可以删除,以大幅降低计算成本。将目标级变异性图像级SL- 级升级系统生成系统生成系统生成到磁级模型,使对等级系统生成系统生成系统生成能力演示生成系统生成系统生成的磁力化区域。