Most recent transformer-based models show impressive performance on vision tasks, even better than Convolution Neural Networks (CNN). In this work, we present a novel, flexible, and effective transformer-based model for high-quality instance segmentation. The proposed method, Segmenting Objects with TRansformers (SOTR), simplifies the segmentation pipeline, building on an alternative CNN backbone appended with two parallel subtasks: (1) predicting per-instance category via transformer and (2) dynamically generating segmentation mask with the multi-level upsampling module. SOTR can effectively extract lower-level feature representations and capture long-range context dependencies by Feature Pyramid Network (FPN) and twin transformer, respectively. Meanwhile, compared with the original transformer, the proposed twin transformer is time- and resource-efficient since only a row and a column attention are involved to encode pixels. Moreover, SOTR is easy to be incorporated with various CNN backbones and transformer model variants to make considerable improvements for the segmentation accuracy and training convergence. Extensive experiments show that our SOTR performs well on the MS COCO dataset and surpasses state-of-the-art instance segmentation approaches. We hope our simple but strong framework could serve as a preferment baseline for instance-level recognition. Our code is available at https://github.com/easton-cau/SOTR.
翻译:最近以变压器为基础的最新模型显示,在愿景任务上的表现令人印象深刻,甚至比进化神经网络(CNN)更好。在这项工作中,我们提出了一种新型、灵活和有效的基于变压器的高质量实例分割模型。拟议的方法,即用TRansfrents(SOTR)分割对象,简化了分解管道,以替代的CNN骨架为基础,并附有两个平行的子任务:(1) 通过变压器预测每个分流类别,(2) 动态地生成与多层次上层抽取模块的分解面罩。SOTR可以有效地提取较低层次的地貌表现,并捕捉到地貌保质网络(FPN)和双变压器(双变压器)之间的远距离环境依赖。与此同时,与原始变压器相比,拟议的双变压器具有时间和资源效率,因为只有一行和一列的注意力才涉及编码像素。此外,SOTR很容易与各种CNN的骨架和变压模型混在一起,从而大大改进分解精度的精确度和培训。广泛的实验表明,我们在STROCO的基线级别上展示了我们SO-CSO-CSO-CSec-C-CSuptution的简单的数据识别系统,可以作为我们现有的基准框架。