Amodal Instance Segmentation (AIS) aims to segment the region of both visible and possible occluded parts of an object instance. While Mask R-CNN-based AIS approaches have shown promising results, they are unable to model high-level features coherence due to the limited receptive field. The most recent transformer-based models show impressive performance on vision tasks, even better than Convolution Neural Networks (CNN). In this work, we present AISFormer, an AIS framework, with a Transformer-based mask head. AISFormer explicitly models the complex coherence between occluder, visible, amodal, and invisible masks within an object's regions of interest by treating them as learnable queries. Specifically, AISFormer contains four modules: (i) feature encoding: extract ROI and learn both short-range and long-range visual features. (ii) mask transformer decoding: generate the occluder, visible, and amodal mask query embeddings by a transformer decoder (iii) invisible mask embedding: model the coherence between the amodal and visible masks, and (iv) mask predicting: estimate output masks including occluder, visible, amodal and invisible. We conduct extensive experiments and ablation studies on three challenging benchmarks i.e. KINS, D2SA, and COCOA-cls to evaluate the effectiveness of AISFormer. The code is available at: https://github.com/UARK-AICV/AISFormer
翻译:以变压器为基础的最新变压器模型显示,在视觉任务上的表现令人印象深刻,甚至比革命神经网络(CNN)还要好。在这项工作中,我们介绍了AISFormer,一个以变压器为主的AIS框架,有一个以变压器为主的面罩头部。AISFormer明确模型显示,在目标感兴趣的区域内,occluder、可见的、模式的和无形的面罩之间的复杂一致性,将它们视为可学习的查询。具体地说,AISFormer的模型无法建模高层次的特征一致性,因为接受的字段范围有限。基于变压器的模型编码:提取ROI,学习短程和长程远的视觉特征。(二) 掩码变压器变压器:产生occluder,可见的和模版的遮罩,以变压器为主。 隐形面罩嵌在目标区域内:模型的COCO-调制模和可见的代码AFAR2,以及(ial-CForal MA ) 包括可辨测的DMIS。</s>