In this paper, we consider the problem of open-vocabulary semantic segmentation (OVS), which aims to segment objects of arbitrary classes instead of pre-defined, closed-set categories. The main contributions are as follows: First, we propose a transformer-based model for OVS, termed as OVSegmentor, which only exploits web-crawled image-text pairs for pre-training without using any mask annotations. OVSegmentor assembles the image pixels into a set of learnable group tokens via a slot-attention based binding module, and aligns the group tokens to the corresponding caption embedding. Second, we propose two proxy tasks for training, namely masked entity completion and cross-image mask consistency. The former aims to infer all masked entities in the caption given the group tokens, that enables the model to learn fine-grained alignment between visual groups and text entities. The latter enforces consistent mask predictions between images that contain shared entities, which encourages the model to learn visual invariance. Third, we construct CC4M dataset for pre-training by filtering CC12M with frequently appeared entities, which significantly improves training efficiency. Fourth, we perform zero-shot transfer on three benchmark datasets, PASCAL VOC 2012, PASCAL Context, and COCO Object. Our model achieves superior segmentation results over the state-of-the-art method by using only 3\% data (4M vs 134M) for pre-training. Code and pre-trained models will be released for future research.
翻译:在本文中,我们考虑开放词汇语义分割(OVS)的问题,它的目的是将任意类对象与预定义的封闭型分类分开,而不用预先定义的封闭型分类。主要贡献如下:首先,我们提议一个基于变压器的OVS变压器模型,称为OVSevisionor,它只利用网络浏览图像文本配对来进行预培训,而不使用任何掩码说明。OVSegSementor将图像像素组合成一组可学习的群象(OVS),通过基于视点的绑定模块,使组符号与相应的标题嵌入相匹配。第二,我们提议两个代理培训任务,即:掩码实体完成和交叉图像图像图像图像拼模的一致性。前,将所有被遮罩的成图像配对,无需使用任何掩码的图像组合符号组合组合组合组合,将图像组合符号组合符号拼凑成一组,鼓励模型进行视觉变换。第三,我们用CC4-COM数据设置高级数据设置,我们经常在升级前,我们用C-L数据格式进行数据转换。