We consider the task of learning a classifier for semantic segmentation using weak supervision in the form of image labels which specify the object classes present in the image. Our method uses deep convolutional neural networks (CNNs) and adopts an Expectation-Maximization (EM) based approach. We focus on the following three aspects of EM: (i) initialization; (ii) latent posterior estimation (E-step) and (iii) the parameter update (M-step). We show that saliency and attention maps, our bottom-up and top-down cues respectively, of simple images provide very good cues to learn an initialization for the EM-based algorithm. Intuitively, we show that before trying to learn to segment complex images, it is much easier and highly effective to first learn to segment a set of simple images and then move towards the complex ones. Next, in order to update the parameters, we propose minimizing the combination of the standard softmax loss and the KL divergence between the true latent posterior and the likelihood given by the CNN. We argue that this combination is more robust to wrong predictions made by the expectation step of the EM method. We support this argument with empirical and visual results. Extensive experiments and discussions show that: (i) our method is very simple and intuitive; (ii) requires only image-level labels; and (iii) consistently outperforms other weakly-supervised state-of-the-art methods with a very high margin on the PASCAL VOC 2012 dataset.
翻译:我们认为,学习语义分解的分类工作是用图像标签的微弱监管来学习语义分解。 我们的方法是使用深层神经神经网络(CNNs),并采用以期望-最大化(EM)为基础的方法。 我们侧重于EM的以下三个方面:(一) 初始化;(二) 隐性后部估计(E步骤)和(三) 参数更新(M步骤)。 我们显示,简单图像的突出和关注地图、我们自下而上和自上而下的提示提供了非常好的提示,以学习基于 EM 的算法的初始化。 我们直观地表明,在试图学习分解复杂图像之前,首先学习一组简单图像,然后转向复杂的图像,是容易和非常有效的。 其次,为了更新参数,我们提议将标准软性损失和KLServe的差与CNN给出的可能性之间的真正潜值差结合起来。 我们说,这种组合比更可靠,我们用更精确的图像分析方法来显示(我们高层次和高层次的图像分析方法) 要求我们用更精确的模型来进行更精确的实验和高层次分析。 我们要求用更精确的推论式的推论式的推论和高的推算。 我们要求用更精确的推论式的推论式的推论式的推算。