Visual grounding localizes regions (boxes or segments) in the image corresponding to given referring expressions. In this work we address image segmentation from referring expressions, a problem that has so far only been addressed in a fully-supervised setting. A fully-supervised setup, however, requires pixel-wise supervision and is hard to scale given the expense of manual annotation. We therefore introduce a new task of weakly-supervised image segmentation from referring expressions and propose Text grounded semantic SEGgmentation (TSEG) that learns segmentation masks directly from image-level referring expressions without pixel-level annotations. Our transformer-based method computes patch-text similarities and guides the classification objective during training with a new multi-label patch assignment mechanism. The resulting visual grounding model segments image regions corresponding to given natural language expressions. Our approach TSEG demonstrates promising results for weakly-supervised referring expression segmentation on the challenging PhraseCut and RefCOCO datasets. TSEG also shows competitive performance when evaluated in a zero-shot setting for semantic segmentation on Pascal VOC.
翻译:在这项工作中,我们从调用表达式中直接从图像级别的调用表达式中学习分解面罩。 迄今为止,这个问题仅在一个完全监督下的环境下得到解决。 但是,完全监督的设置需要像素监督,而且由于人工注解的费用,因此很难进行缩放。 因此,我们引入了一个新任务,即从调用表达式和提议文本基于语义的 SEGGmentation (TSEG) 中,对图像的分解面罩进行不力监督的图像分解,直接从图像级别的调用表达式中学习,而没有像素级的注解。我们基于变异器的方法用一个新的多标签补丁分配机制来计算补丁文本的相似性,并指导培训中的分类目标。由此产生的视觉定位模型区块图像区域与给定的自然语言表达式相对应,我们的方法显示,在挑战性 PhraseCut 和 RefCOCO 数据集上,对调用不力超强的调用表达式表示式的分解。 TSEG在为Pascar VOC 的零点定分解时,也显示了竞争性性表现。