We investigate Referring Image Segmentation (RIS), which outputs a segmentation map corresponding to the natural language description. Addressing RIS efficiently requires considering the interactions happening across visual and linguistic modalities and the interactions within each modality. Existing methods are limited because they either compute different forms of interactions sequentially (leading to error propagation) or ignore intramodal interactions. We address this limitation by performing all three interactions simultaneously through a Synchronous Multi-Modal Fusion Module (SFM). Moreover, to produce refined segmentation masks, we propose a novel Hierarchical Cross-Modal Aggregation Module (HCAM), where linguistic features facilitate the exchange of contextual information across the visual hierarchy. We present thorough ablation studies and validate our approach's performance on four benchmark datasets, showing considerable performance gains over the existing state-of-the-art (SOTA) methods.
翻译:我们调查了图像分割(Regis),它产生了与自然语言描述相对应的分块图。 高效处理RIS需要考虑不同视觉和语言模式的相互作用以及每种模式的相互作用。 现有方法有限,因为它们要么按顺序计算不同形式的相互作用(导致错误传播),要么忽视内部模式的相互作用。 我们通过同步多模式融合模块(SFM)同时进行所有三种相互作用来解决这个问题。 此外,为了产生精细的分块遮罩,我们提议了一个新的等级跨模式聚合模块(HCAM ), 语言特征可以促进视觉等级之间的背景信息的交流。 我们提出彻底的反差研究,并验证我们的方法在四个基准数据集上的绩效,表明现有最先进的(SOTA)方法取得了相当大的绩效收益。