The goal of self-supervised visual representation learning is to learn strong, transferable image representations, with the majority of research focusing on object or scene level. On the other hand, representation learning at part level has received significantly less attention. In this paper, we propose an unsupervised approach to object part discovery and segmentation and make three contributions. First, we construct a proxy task through a set of objectives that encourages the model to learn a meaningful decomposition of the image into its parts. Secondly, prior work argues for reconstructing or clustering pre-computed features as a proxy to parts; we show empirically that this alone is unlikely to find meaningful parts; mainly because of their low resolution and the tendency of classification networks to spatially smear out information. We suggest that image reconstruction at the level of pixels can alleviate this problem, acting as a complementary cue. Lastly, we show that the standard evaluation based on keypoint regression does not correlate well with segmentation quality and thus introduce different metrics, NMI and ARI, that better characterize the decomposition of objects into parts. Our method yields semantic parts which are consistent across fine-grained but visually distinct categories, outperforming the state of the art on three benchmark datasets. Code is available at the project page: https://www.robots.ox.ac.uk/~vgg/research/unsup-parts/.
翻译:自我监督的视觉表现学习的目标是学习强大、可转移的图像表现,大部分研究侧重于对象或场景层面。另一方面,部分层次的代表性学习受到的关注明显减少。在本文中,我们提出一种不受监督的方法来反对部分发现和分解,并做出三项贡献。首先,我们通过一套目标构建一个代理任务,鼓励模型学习将图像有意义的分解到其部分。第二,先前的工作是重建或组合预合成特征作为部分的代名;我们的经验表明,仅此一项不可能找到有意义的部分;主要因为其分辨率低和分类网络倾向于空间抹黑信息。我们建议,在像素层面进行图像重建可以缓解这一问题,作为补充提示。最后,我们表明基于关键点回归的标准评价与分解质量不相干,因此引入不同的指标NMI和ARI,更好地将物体分解成部分。我们的方法产生精密/分解/分解部分的语义部分。我们的方法在www/sixormax之间生成了精准/scocal 数据。在httpsco 3 precreax action acros acrecodustrational laction acal lades.