In this work, we address the task of referring image segmentation (RIS), which aims at predicting a segmentation mask for the object described by a natural language expression. Most existing methods focus on establishing unidirectional or directional relationships between visual and linguistic features to associate two modalities together, while the multi-scale context is ignored or insufficiently modeled. Multi-scale context is crucial to localize and segment those objects that have large scale variations during the multi-modal fusion process. To solve this problem, we propose a simple yet effective Cascaded Multi-modal Fusion (CMF) module, which stacks multiple atrous convolutional layers in parallel and further introduces a cascaded branch to fuse visual and linguistic features. The cascaded branch can progressively integrate multi-scale contextual information and facilitate the alignment of two modalities during the multi-modal fusion process. Experimental results on four benchmark datasets demonstrate that our method outperforms most state-of-the-art methods. Code is available at https://github.com/jianhua2022/CMF-Refseg.
翻译:在这项工作中,我们处理的是参考图像分割(RIS)的任务,其目的是预测自然语言表达表达所描述的物体的分离面罩,大多数现有方法侧重于在视觉和语言特征之间建立单向或方向关系,将两种模式结合起来,而多尺度背景则被忽视或没有进行充分的建模。多尺度环境对于在多模式融合过程中具有巨大规模差异的物体进行本地化和分割至关重要。为了解决这一问题,我们提议了一个简单而有效的封存式多模式融合模块,该模块平行地堆叠着多个振动层,并进一步将一个级联的分支用于融合视觉和语言特征。级联的分支可以逐步整合多尺度背景信息,并在多模式融合过程中促进两种模式的协调统一。四个基准数据集的实验结果表明,我们的方法超越了最先进的状态方法。代码见https://github.com/jianhua2022/CMF-Refseg。