Weakly-supervised semantic segmentation (WSSS) with image-level labels is an important and challenging task. Due to the high training efficiency, end-to-end solutions for WSSS have received increasing attention from the community. However, current methods are mainly based on convolutional neural networks and fail to explore the global information properly, thus usually resulting in incomplete object regions. In this paper, to address the aforementioned problem, we introduce Transformers, which naturally integrate global information, to generate more integral initial pseudo labels for end-to-end WSSS. Motivated by the inherent consistency between the self-attention in Transformers and the semantic affinity, we propose an Affinity from Attention (AFA) module to learn semantic affinity from the multi-head self-attention (MHSA) in Transformers. The learned affinity is then leveraged to refine the initial pseudo labels for segmentation. In addition, to efficiently derive reliable affinity labels for supervising AFA and ensure the local consistency of pseudo labels, we devise a Pixel-Adaptive Refinement module that incorporates low-level image appearance information to refine the pseudo labels. We perform extensive experiments and our method achieves 66.0% and 38.9% mIoU on the PASCAL VOC 2012 and MS COCO 2014 datasets, respectively, significantly outperforming recent end-to-end methods and several multi-stage competitors. Code is available at https://github.com/rulixiang/afa.
翻译:由于培训效率高,因此社区日益关注SWSS的端到端解决方案,但目前的方法主要基于神经共振网络,没有适当探索全球信息,因此通常导致目标区域不完全。在本文件中,为了解决上述问题,我们引入了自然整合全球信息的变换器,为端到端的SSS生成了更完整的初始假标签。受变换者自我维护与语义亲近性之间内在一致性的激励,我们建议采用 " 注意(AFA) " 模块,以学习多头自我维护(MHSA)的静脉亲近性,从而通常导致目标区域不完全。在本文中,为了解决上述问题,我们引入了自然整合全球信息的转换器,为端到端到端的SWSSS生成了更可靠的亲近性标签,并确保了假标签的本地一致性,我们在变换者中的Pix-A-AAA9 和变相性纸质性纸质性纸质性纸质化纸质性纸质性纸质性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性纸性关系。