Recently, referring image segmentation has aroused widespread interest. Previous methods perform the multi-modal fusion between language and vision at the decoding side of the network. And, linguistic feature interacts with visual feature of each scale separately, which ignores the continuous guidance of language to multi-scale visual features. In this work, we propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network, and uses language to refine the multi-modal features progressively. Moreover, a co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features, which can promote the consistent of the cross-modal information representation in the semantic space. Finally, we propose a boundary enhancement module (BEM) to make the network pay more attention to the fine structure. The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance under different evaluation metrics without any post-processing.
翻译:最近,参考图像分割引起了广泛的兴趣。 以往的方法在网络解码的侧面使用语言和视觉之间的多模式融合。 而且,语言特征与每个尺度的视觉特征相互作用,这忽略了语言对多尺度视觉特征的持续指导。 在这项工作中,我们建议建立一个编码聚合网络(EFN),将视觉编码器转换成多模式特征学习网络,并使用语言逐步完善多模式特征。 此外,在新新新市场中嵌入了一个共同关注机制,以实现多模式特征的平行更新,这可以促进语义空间跨模式信息代表的一致性。最后,我们提议了一个边界强化模块(BEM),以使网络更加关注精细的结构。四个基准数据集的实验结果显示,拟议方法在没有任何后处理的情况下,在不同评价指标下取得了最新业绩。