Referring Image Segmentation (RIS) aims at segmenting the target object from an image referred by one given natural language expression. The diverse and flexible expressions as well as complex visual contents in the images raise the RIS model with higher demands for investigating fine-grained matching behaviors between words in expressions and objects presented in images. However, such matching behaviors are hard to be learned and captured when the visual cues of referents (i.e. referred objects) are insufficient, as the referents with weak visual cues tend to be easily confused by cluttered background at boundary or even overwhelmed by salient objects in the image. And the insufficient visual cues issue can not be handled by the cross-modal fusion mechanisms as done in previous work. In this paper, we tackle this problem from a novel perspective of enhancing the visual information for the referents by devising a Two-stage Visual cues enhancement Network (TV-Net), where a novel Retrieval and Enrichment Scheme (RES) and an Adaptive Multi-resolution feature Fusion (AMF) module are proposed. Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image, especially when the visual information of the referent is inadequate, thus produces better segmentation results. Extensive experiments are conducted to validate the effectiveness of the proposed method on the RIS task, with our proposed TV-Net surpassing the state-of-the-art approaches on four benchmark datasets.
翻译:图像截面(RIS) 旨在将目标对象从某个自然语言表达式所引用的图像中分离出来。 图像中多样化和灵活的表达式以及复杂的视觉内容都提高了RIS模式,要求调查表达式和图像中显示的对象之间的细微匹配行为。 然而,当引用器(即被引用对象)的视觉提示(即被引用对象)的视觉提示不足时,这种匹配行为是难以学习和捕捉的,因为带有微弱视觉提示的引用对象往往很容易被边界上的模糊背景或甚至被图像中的突出对象所淹没。 图像中的视觉提示问题不能像以前的工作那样由交叉模式融合机制来处理。 在本文件中,我们从新颖的角度来解决这一问题,即通过设计两阶段视觉提示增强网络(即被引用对象对象)的视觉提示(TV-Net)网络,其中提出了新型的Retrerivalval 和再适应性多分辨率(AMF) 特性整合模块。 通过两阶段的增强,我们提议的电视网络定位提示问题无法被处理,因此,在进行更精确的图像分析过程中,我们的拟议图像分析的图像分析分析中,我们的拟议的图像分析分析分析结果将更精确分析结果比重。