Recently, Synthetic data-based Instance Segmentation has become an exceedingly favorable optimization paradigm since it leverages simulation rendering and physics to generate high-quality image-annotation pairs. In this paper, we propose a Parallel Pre-trained Transformers (PPT) framework to accomplish the synthetic data-based Instance Segmentation task. Specifically, we leverage the off-the-shelf pre-trained vision Transformers to alleviate the gap between natural and synthetic data, which helps to provide good generalization in the downstream synthetic data scene with few samples. Swin-B-based CBNet V2, SwinL-based CBNet V2 and Swin-L-based Uniformer are employed for parallel feature learning, and the results of these three models are fused by pixel-level Non-maximum Suppression (NMS) algorithm to obtain more robust results. The experimental results reveal that PPT ranks first in the CVPR2022 AVA Accessibility Vision and Autonomy Challenge, with a 65.155% mAP.
翻译:最近,基于合成数据的现成分类模式已经成为一个极为有利的优化模式,因为它利用模拟成像和物理生成高质量的图像批注配对。在本文中,我们提议建立一个平行的预培训变异器框架(PPT),以完成基于合成数据的现成实例分割任务。具体地说,我们利用现成的预先培训的视觉变异器,以缩小自然和合成数据之间的差距,这有助于以少量样本在下游合成数据现场提供良好的概括化。基于Swin-B的CBNet V2、基于SwinL的CBNet V2和基于Swin-L的统一器,用于平行的特征学习,这三种模型的结果由像素级非最大抑制(NMS)算法结合,以获得更稳健的结果。实验结果表明,PPT在CVPR2022 AVA无障碍视野和自主挑战中排名第一,其排名为65.155 mAP。