Weakly supervised object localization (WSOL) aims to learn object localizer solely by using image-level labels. The convolution neural network (CNN) based techniques often result in highlighting the most discriminative part of objects while ignoring the entire object extent. Recently, the transformer architecture has been deployed to WSOL to capture the long-range feature dependencies with self-attention mechanism and multilayer perceptron structure. Nevertheless, transformers lack the locality inductive bias inherent to CNNs and therefore may deteriorate local feature details in WSOL. In this paper, we propose a novel framework built upon the transformer, termed LCTR (Local Continuity TRansformer), which targets at enhancing the local perception capability of global features among long-range feature dependencies. To this end, we propose a relational patch-attention module (RPAM), which considers cross-patch information on a global basis. We further design a cue digging module (CDM), which utilizes local features to guide the learning trend of the model for highlighting the weak local responses. Finally, comprehensive experiments are carried out on two widely used datasets, ie, CUB-200-2011 and ILSVRC, to verify the effectiveness of our method.
翻译:微弱监督对象本地化 (WSOL) 旨在仅通过使用图像级标签来学习对象本地化。 卷发神经网络(CNN) 基础技术往往导致突出物体中最具歧视性的部分,而忽略整个对象范围。 最近, 将变压器结构安装到WSOL, 以捕捉带有自留机制和多层感应器结构的远程特征依赖性。 然而, 变压器缺乏CNN所固有的局部感应偏差, 从而可能恶化WSOL中本地特性的细节 。 在本文中, 我们提议了一个建立在变压器上的新框架, 名为 LCTR (CLCTR (Centry Colentry Transrefor), 其目标就是提高全球特征在远程特性依赖性之间对本地的感知能力。 为此, 我们提出一个关系偏差感模块(RPAM), 在全球范围内考虑交叉匹配信息 。 我们进一步设计一个提示挖掘模块(CDM), 利用本地特性来指导模型的学习趋势, 突出薄弱的地方反应。 最后, 在两种广泛使用的数据集、 CU-200- VLS- 方法上进行了全面实验。