参照分部分的交叉模式渐进理解 (Cross-Modal Progressive Comprehension for Referring Segmentation)

Given a natural language expression and an image/video, the goal of referring segmentation is to produce the pixel-level masks of the entities described by the subject of the expression. Previous approaches tackle this problem by implicit feature interaction and fusion between visual and linguistic modalities in a one-stage manner. However, human tends to solve the referring problem in a progressive manner based on informative words in the expression, i.e., first roughly locating candidate entities and then distinguishing the target one. In this paper, we propose a Cross-Modal Progressive Comprehension (CMPC) scheme to effectively mimic human behaviors and implement it as a CMPC-I (Image) module and a CMPC-V (Video) module to improve referring image and video segmentation models. For image data, our CMPC-I module first employs entity and attribute words to perceive all the related entities that might be considered by the expression. Then, the relational words are adopted to highlight the target entity as well as suppress other irrelevant ones by spatial graph reasoning. For video data, our CMPC-V module further exploits action words based on CMPC-I to highlight the correct entity matched with the action cues by temporal graph reasoning. In addition to the CMPC, we also introduce a simple yet effective Text-Guided Feature Exchange (TGFE) module to integrate the reasoned multimodal features corresponding to different levels in the visual backbone under the guidance of textual information. In this way, multi-level features can communicate with each other and be mutually refined based on the textual context. Combining CMPC-I or CMPC-V with TGFE can form our image or video version referring segmentation frameworks and our frameworks achieve new state-of-the-art performances on four referring image segmentation benchmarks and three referring video segmentation benchmarks respectively.

翻译：在自然语言表达和图像/视频中,参考分解的目的是为表达式所描述的实体制作像素面罩,以生成像素面罩; 以往的方法通过隐含特征互动以及视觉和语言模式之间的融合,以一阶段的方式解决这一问题; 然而,人类倾向于根据表达式中的信息词,即首先大致查找候选实体,然后区分目标1,以渐进方式解决问题; 在本文中,我们提议了一个跨模式进步融合(CMPC)计划,以有效模仿人类行为,将其作为CMPC-I(IMage) 精化模块和CMPC-V(Video) 模块来解决这一问题; 在图像数据中,我们的CMPC-I模块首先使用实体,然后根据表达式中可能考虑的所有相关实体,然后区分目标实体。在视频数据中,我们CMPC-V模块进一步利用基于 CMPC-I(IMB) 精细化(IMC-I) 模块的动作文字,在视频-ialityal-I 格式中,我们也可以将视频-I 格式的文本转换为Syreal-deal-deal exalalalalalal exal exalalal exal exal exal exal exerview ex exmal ex exmlation exmal ex ex exmal exmal ex ex ex ex ex ex ex ex ex ex ex exm exmal exmlation exmlation exmlation extralational exmational ex ex ex ex ex ex exmal ex ex,我们向您 exal extra extra exal exal exal exal exmal exal exal exal exal exal exal exal exal exal exalalalalalalalalalalalalalalalalalalalalalal exal exal exal exal exalalalal ex exal exalal ex exalalalalalalalalalal ex ex ex ex ex ex