Object proposal generation is an important and fundamental task in computer vision. In this paper, we propose ProposalCLIP, a method towards unsupervised open-category object proposal generation. Unlike previous works which require a large number of bounding box annotations and/or can only generate proposals for limited object categories, our ProposalCLIP is able to predict proposals for a large variety of object categories without annotations, by exploiting CLIP (contrastive language-image pre-training) cues. Firstly, we analyze CLIP for unsupervised open-category proposal generation and design an objectness score based on our empirical analysis on proposal selection. Secondly, a graph-based merging module is proposed to solve the limitations of CLIP cues and merge fragmented proposals. Finally, we present a proposal regression module that extracts pseudo labels based on CLIP cues and trains a lightweight network to further refine proposals. Extensive experiments on PASCAL VOC, COCO and Visual Genome datasets show that our ProposalCLIP can better generate proposals than previous state-of-the-art methods. Our ProposalCLIP also shows benefits for downstream tasks, such as unsupervised object detection.
翻译:在计算机视野中,生成对象建议是一项重要和根本的任务。 在本文件中, 我们提出“ 提案CLIP ”, 这是一种实现不受监督的开放类别建议生成的方法。 第二, 与以往需要大量捆绑框说明和/或只能产生有限对象类别提案的工程不同, 我们的提案CLIP 能够通过利用 CLIP (视频语言图像培训前) 提示, 预测各种无注释的物体类别提案。 首先, 我们分析未受监督的开放类别建议生成的 CLIP, 并根据我们对提案选择的经验性分析设计目标性评分。 第二, 提议基于图表的合并模块, 以解决 CLIP 提示的局限性, 并合并零碎碎的建议。 最后, 我们提出一个基于 CLIP 提示提取假标签的回归模块, 并训练一个轻量网络来进一步完善提案。 关于 PACAL VOC、 COCO 和 视觉基因组数据集的广泛实验显示, 我们的提案CLIP 能够比以往的状态方法更好地生成建议。 我们的提案CLIP 也展示了下游任务的好处, 如未监督的物体探测 。