Weakly supervised semantic segmentation (WSSS), which aims to mine the object regions by merely using class-level labels, is a challenging task in computer vision. The current state-of-the-art CNN-based methods usually adopt Class-Activation-Maps (CAMs) to highlight the potential areas of the object, however, they may suffer from the part-activated issues. To this end, we try an early attempt to explore the global feature attention mechanism of vision transformer in WSSS task. However, since the transformer lacks the inductive bias as in CNN models, it can not boost the performance directly and may yield the over-activated problems. To tackle these drawbacks, we propose a Convolutional Neural Networks Refined Transformer (CRT) to mine a globally complete and locally accurate class activation maps in this paper. To validate the effectiveness of our proposed method, extensive experiments are conducted on PASCAL VOC 2012 and CUB-200-2011 datasets. Experimental evaluations show that our proposed CRT achieves the new state-of-the-art performance on both the weakly supervised semantic segmentation task the weakly supervised object localization task, which outperform others by a large margin.
翻译:微弱监管的语义分解(WSSS)旨在仅仅使用等级标签来对目标区域进行地雷污染,这是计算机愿景中一项具有挑战性的任务。目前最先进的有线电视新闻网(CNN)使用的方法通常会采用高端神经网络改良变异器(CRT)来突出该目标的潜在领域,但是,它们可能会受到部分激活问题的影响。为此,我们尝试尽早尝试探索WSSSS任务中视力变异器的全球特征关注机制。然而,由于变异器缺乏CNN模型中的诱导偏差,它无法直接提升性能,并可能产生过度激活的问题。为了解决这些退步,我们提议在本文中设置一个革命性神经网络改良器(CRT),以便在全球范围完整和当地准确的级变异图中埋设。为了验证我们拟议方法的有效性,在PASAL VOC 2012 和 CUB-200-2011 数据集上进行了广泛的实验。实验性评估显示,我们提议的CRT在CNN模型中无法直接推进新的州级目标,通过监管的边际任务完成较弱的边际任务。