Weakly supervised instance segmentation with image-level labels, instead of expensive pixel-level masks, remains unexplored. In this paper, we tackle this challenging problem by exploiting class peak responses to enable a classification network for instance mask extraction. With image labels supervision only, CNN classifiers in a fully convolutional manner can produce class response maps, which specify classification confidence at each image location. We observed that local maximums, i.e., peaks, in a class response map typically correspond to strong visual cues residing inside each instance. Motivated by this, we first design a process to stimulate peaks to emerge from a class response map. The emerged peaks are then back-propagated and effectively mapped to highly informative regions of each object instance, such as instance boundaries. We refer to the above maps generated from class peak responses as Peak Response Maps (PRMs). PRMs provide a fine-detailed instance-level representation, which allows instance masks to be extracted even with some off-the-shelf methods. To the best of our knowledge, we for the first time report results for the challenging image-level supervised instance segmentation task. Extensive experiments show that our method also boosts weakly supervised pointwise localization as well as semantic segmentation performance, and reports state-of-the-art results on popular benchmarks, including PASCAL VOC 2012 and MS COCO.
翻译:在本文中,我们通过利用阶级峰值反应来应对这一具有挑战性的问题,以便建立分类网络,例如面具提取。只有图像标签监督,CNN分类人员才能以完全进化的方式制作等级响应图,以具体确定每个图像位置的分类信任度。我们观察到,在课堂响应图中,地方最大值,即峰值,通常与每个图像位置的强烈直观提示相对应。受此激励,我们首先设计了一个过程,刺激从阶级响应图中出现峰值。然后,出现的峰值被反向调整,并有效地绘制到每个对象实例高度信息丰富的区域,例如边界。我们提到上述由阶级峰值反应生成的地图,即峰值,即每个图像位置的分类信任度。我们发现,在课堂响应地图的最大值中,即峰值,通常与每个图像显示的强烈直观信号匹配。我们最了解的情况是,首次报告关于具有挑战性的图像水平的市级应对基准,包括监管的市级监管的市级部门化。