The recent emerged weakly supervised object localization (WSOL) methods can learn to localize an object in the image only using image-level labels. Previous works endeavor to perceive the interval objects from the small and sparse discriminative attention map, yet ignoring the co-occurrence confounder (e.g., bird and sky), which makes the model inspection (e.g., CAM) hard to distinguish between the object and context. In this paper, we make an early attempt to tackle this challenge via causal intervention (CI). Our proposed method, dubbed CI-CAM, explores the causalities among images, contexts, and categories to eliminate the biased co-occurrence in the class activation maps thus improving the accuracy of object localization. Extensive experiments on several benchmarks demonstrate the effectiveness of CI-CAM in learning the clear object boundaries from confounding contexts. Particularly, in CUB-200-2011 which severely suffers from the co-occurrence confounder, CI-CAM significantly outperforms the traditional CAM-based baseline (58.39% vs 52.4% in top-1 localization accuracy). While in more general scenarios such as ImageNet, CI-CAM can also perform on par with the state of the arts.
翻译:最近出现的受微弱监督的物体定位方法(WSOL) 能够学会将图像中的物体定位于本地化, 只能使用图像级标签 。 先前的工作是观察小的和分散的歧视性关注地图中的间隔对象, 但却忽略了共同碰撞者( 如鸟和天空), 这使得模型检查( 例如 CAM) 难以区分对象和背景。 在本文件中, 我们试图通过因果关系干预( CI) 来尽早应对这一挑战。 我们所拟议的方法, 称为 CI- CAM, 探索图像、 环境 和类别之间的因果关系, 以消除类别激活图中的偏差共生关系, 从而提高目标定位的准确性。 对几个基准的广泛实验表明 CIC- CAM 在学习与交错环境中的清晰对象界限方面的有效性。 特别是在CUB- 200- 2011 中, 严重受共振障碍破坏的 CUB- CAM, 大大超越了传统的 CAM 基线( 58.39 % 和 52.4% 最高一级-1 地方化精度 ) 。 在一般的图像网络中, 也表现了CI- CAM 的精确性。