TS-CAM: 微弱监督对象定位图 (TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization)

Weakly supervised object localization (WSOL) is a challenging problem when given image category labels but requires to learn object localization models. Optimizing a convolutional neural network (CNN) for classification tends to activate local discriminative regions while ignoring complete object extent, causing the partial activation issue. In this paper, we argue that partial activation is caused by the intrinsic characteristics of CNN, where the convolution operations produce local receptive fields and experience difficulty to capture long-range feature dependency among pixels. We introduce the token semantic coupled attention map (TS-CAM) to take full advantage of the self-attention mechanism in visual transformer for long-range dependency extraction. TS-CAM first splits an image into a sequence of patch tokens for spatial embedding, which produce attention maps of long-range visual dependency to avoid partial activation. TS-CAM then re-allocates category-related semantics for patch tokens, enabling each of them to be aware of object categories. TS-CAM finally couples the patch tokens with the semantic-agnostic attention map to achieve semantic-aware localization. Experiments on the ILSVRC/CUB-200-2011 datasets show that TS-CAM outperforms its CNN-CAM counterparts by 7.1%/27.1% for WSOL, achieving state-of-the-art performance.

翻译：当给定图像类别标签时, 微弱监督对象本地化( WSOL) 是一个具有挑战性的问题, 但需要学习对象本地化模型。优化用于分类的进化神经网络( CNN) 优化可变神经网络( CNN) 往往会激活局部歧视区域, 同时忽略完整的对象范围, 导致部分启动问题。在本文中, 我们争论部分激活是由CNN的内在特征造成的, 即连动操作产生本地可接收字段, 并难以捕捉像素之间的远程特征依赖性。我们引入了象征性语义连接关注地图( TS- CAM), 以充分利用视觉变异器中的自我注意机制, 远程调控用。 TS- CAM 首先将图像分割成空间嵌入的补全符号序列, 产生远程视觉依赖性关注图, 以避免部分激活。 TS- CAM 重新配置与类别相关的语义解调, 使他们每个人都能够了解对象类别。 TS- CAM 最终将补装符号与视觉- SMS- CSIS- AS- AS- AS- ASettyal 显示SAL- AS- AS- AS- AS- smal- smals IM- sem- smals- smals- sal- sal- smalmaliz- smals