We investigate Referring Image Segmentation (RIS), which outputs a segmentation map corresponding to the natural language description. Addressing RIS efficiently requires considering the interactions happening \emph{across} visual and linguistic modalities and the interactions \emph{within} each modality. Existing methods are limited because they either compute different forms of interactions \emph{sequentially} (leading to error propagation) or \emph{ignore} intramodal interactions. We address this limitation by performing all three interactions \emph{simultaneously} through a Synchronous Multi-Modal Fusion Module (SFM). Moreover, to produce refined segmentation masks, we propose a novel Hierarchical Cross-Modal Aggregation Module (HCAM), where linguistic features facilitate the exchange of contextual information across the visual hierarchy. We present thorough ablation studies and validate our approach's performance on four benchmark datasets, showing considerable performance gains over the existing state-of-the-art (SOTA) methods.
翻译:我们调查了与自然语言描述相对应的图像分割图(RIS) 。 有效处理RIS需要考虑每个模式中发生的互动 \ emph{ across} 视觉和语言模式以及互动 \ emph{ 内部 。 现有的方法是有限的, 因为它们计算了不同形式的互动 \ emph{ 顺序( 导致错误传播) 或\ emph{ ignore} 内部互动 。 我们通过一个同步的多模式融合模块( SFM) 来应对这一局限性。 此外, 为了制作精细的分解面罩, 我们提议了一个新型的跨等级跨模式聚合模块( HCAM ), 语言特征可以促进不同视觉等级的背景信息的交流。 我们提出彻底的对比研究,并验证我们的方法在四个基准数据集上的绩效, 展示了现有状态( SOTA) 方法的显著绩效收益 。