We investigate Referring Image Segmentation (RIS), which outputs a segmentation map corresponding to the given natural language description. To solve RIS efficiently, we need to understand each word's relationship with other words, each region in the image to other regions, and cross-modal alignment between linguistic and visual domains. We argue that one of the limiting factors in the recent methods is that they do not handle these interactions simultaneously. To this end, we propose a novel architecture called JRNet, which uses a Joint Reasoning Module(JRM) to concurrently capture the inter-modal and intra-modal interactions. The output of JRM is passed through a novel Cross-Modal Multi-Level Fusion (CMMLF) module which further refines the segmentation masks by exchanging contextual information across visual hierarchy through linguistic features acting as a bridge. We present thorough ablation studies and validate our approach's performance on four benchmark datasets, showing considerable performance gains over the existing state-of-the-art methods.
翻译:我们调查了图像分割(RIS),它产生一个与给定自然语言描述相对应的分区图。为了高效地解决RIS,我们需要理解每个字与其他词的关系、图像中每个区域与其他区域的关系、语言和视觉领域的交叉模式一致。我们争论说,最近方法中的一个限制因素是它们不能同时处理这些相互作用。为此,我们建议建立一个名为JRNet的新颖结构,它使用一个联合理性模块(JRM)来同时捕捉现代和现代内部的互动。JRM的输出通过一个新型的跨模式多层次融合模块传递,该模块通过语言特征作为桥梁,在视觉结构之间交流背景信息,从而进一步改进分解面面面面。我们提出彻底的反向研究,并验证我们在四个基准数据集上的做法绩效,展示了现有最新方法的显著绩效收益。