We tackle the challenging task of unsupervised object localization in this work. Recently, transformers trained with self-supervised learning have been shown to exhibit object localization properties without being trained for this task. In this work, we present Multiple Object localization with Self-supervised Transformers (MOST) that uses features of transformers trained using self-supervised learning to localize multiple objects in real world images. MOST analyzes the similarity maps of the features using box counting; a fractal analysis tool to identify tokens lying on foreground patches. The identified tokens are then clustered together, and tokens of each cluster are used to generate bounding boxes on foreground regions. Unlike recent state-of-the-art object localization methods, MOST can localize multiple objects per image and outperforms SOTA algorithms on several object localization and discovery benchmarks on PASCAL-VOC 07, 12 and COCO20k datasets. Additionally, we show that MOST can be used for self-supervised pre-training of object detectors, and yields consistent improvements on fully, semi-supervised object detection and unsupervised region proposal generation.
翻译:本文旨在解决无监督对象定位这一难题。最近,用自监督学习训练的Transformer展现出在未受过对象定位任务训练的情况下具有对象定位属性。在本文中,我们提出了使用自监督Transformer进行多目标定位和对象发现(MOST)技术,这种技术使用经过自监督学习训练的Transformer特征来定位真实图像中的多个目标。MOST使用框计数(fractal analysis tool)分析特征的相似性图,并确定位于前景区域上的标记(tokens)。然后MOST将类似标记进行聚类,并使用每个聚类的标记生成前景区域上的边界框。与最近的最先进对象定位方法不同,MOST可以在每张图像上定位多个对象,并且在PASCAL-VOC007、PASCAL-VOC012和COCO20K数据集等几个目标定位和发现基准测试中表现优异,超过了现有技术。此外,我们还展示了MOST可以用于自监督预训练目标探测器,并且在完全、半监督对象检测和无监督区域提案生成方面取得了一致的提升。