In this paper we present Mask DINO, a unified object detection and segmentation framework. Mask DINO extends DINO (DETR with Improved Denoising Anchor Boxes) by adding a mask prediction branch which supports all image segmentation tasks (instance, panoptic, and semantic). It makes use of the query embeddings from DINO to dot-product a high-resolution pixel embedding map to predict a set of binary masks. Some key components in DINO are extended for segmentation through a shared architecture and training process. Mask DINO is simple, efficient, and scalable, and it can benefit from joint large-scale detection and segmentation datasets. Our experiments show that Mask DINO significantly outperforms all existing specialized segmentation methods, both on a ResNet-50 backbone and a pre-trained model with SwinL backbone. Notably, Mask DINO establishes the best results to date on instance segmentation (54.5 AP on COCO), panoptic segmentation (59.4 PQ on COCO), and semantic segmentation (60.8 mIoU on ADE20K) among models under one billion parameters. Code is available at \url{https://github.com/IDEACVR/MaskDINO}.
翻译:在本文中,我们展示了一个统一的天体探测和分割框架Mask DINO。Mask DINO 通过添加一个支持所有图像分割任务( Instance, panvision, panvision, 和 semantic)的掩码预测分支,将DINO的查询嵌入到点产品中,将高分辨率像素嵌入图用于预测一套二元遮罩。DINO中的一些关键组成部分通过一个共同的架构和培训进程扩展为分离。Mask DINO是简单、高效和可缩放的,它可以从大规模联合检测和分割数据集中受益。我们的实验显示,Mask DINO大大超越了所有现有的专门分割方法,既在ResNet-50骨架上,又在SwinL主干线上经过预先训练的模型。 值得注意的是,Mask DINO 建立了实例分割(54.5 AP on COCO) 、在COCOCO 中进行截断面分割(59.4 PQ) 和Smantical 解剖(60.8 mIK_MIADV) 的代码/DU 10亿个模型。