Classification networks can be used to localize and segment objects in images by means of class activation maps (CAMs). However, without pixel-level annotations, classification networks are known to (1) mainly focus on discriminative regions, and (2) to produce diffuse CAMs without well-defined prediction contours. In this work, we approach both problems with two contributions for improving CAM learning. First, we incorporate importance sampling based on the class-wise probability mass function induced by the CAMs to produce stochastic image-level class predictions. This results in CAMs which activate over a larger extent of objects. Second, we formulate a feature similarity loss term which aims to match the prediction contours with edges in the image. As a third contribution, we conduct experiments on the PASCAL VOC 2012 benchmark dataset to demonstrate that these modifications significantly increase the performance in terms of contour accuracy, while being comparable to current state-of-the-art methods in terms of region similarity.
翻译:分类网络可以通过类激活图(CAMs)来定位和分割图像中的物体。然而,在没有像素级注释的情况下,分类网络主要关注区分区域,并且产生模糊的 CAMs 而没有明确定义的预测轮廓。在这项工作中,我们提出了两个贡献来改进 CAM 学习,并解决这两个问题。首先,我们根据CAMs导致的类别概率质量函数,加入了重要性采样,以产生随机的图像级类别预测。这导致 CAM 激活较大的物体区域。其次,我们制定了一个特征相似度损失项,旨在将预测轮廓与图像中的边缘匹配。作为第三个贡献,我们对PASCAL VOC 2012基准数据集进行实验,证明这些修改显着提高了轮廓准确性的性能,同时在区域相似性方面与当前最先进的方法相当。