We present Neural Congealing -- a zero-shot self-supervised framework for detecting and jointly aligning semantically-common content across a given set of images. Our approach harnesses the power of pre-trained DINO-ViT features to learn: (i) a joint semantic atlas -- a 2D grid that captures the mode of DINO-ViT features in the input set, and (ii) dense mappings from the unified atlas to each of the input images. We derive a new robust self-supervised framework that optimizes the atlas representation and mappings per image set, requiring only a few real-world images as input without any additional input information (e.g., segmentation masks). Notably, we design our losses and training paradigm to account only for the shared content under severe variations in appearance, pose, background clutter or other distracting objects. We demonstrate results on a plethora of challenging image sets including sets of mixed domains (e.g., aligning images depicting sculpture and artwork of cats), sets depicting related yet different object categories (e.g., dogs and tigers), or domains for which large-scale training data is scarce (e.g., coffee mugs). We thoroughly evaluate our method and show that our test-time optimization approach performs favorably compared to a state-of-the-art method that requires extensive training on large-scale datasets.
翻译:我们展示了Neural Congealing -- -- 一个零点自我监督的自我监督框架,用于检测和共同校准特定图像集中常见内容。我们的方法利用预先训练的DINO-ViT功能的力量学习:(一) 联合语义地图集 -- -- 一个2D网格,捕捉输入集中DINO-ViT特征的模式,以及(二) 从统一的地图集中为每个输入图像绘制密集的映像图。我们产生了一个新的强大的自我监督框架,优化每套图像的地图集显示和绘图,只需要少数真实世界图像作为输入,而没有任何额外的投入信息(例如,截面面面遮罩)。值得注意的是,我们设计我们的损失和培训模式只是为了说明在外观、外观、布局、背景布局或其他转移注意力的物体的剧烈变化下共享内容。我们展示了众多具有挑战性的图像集,包括一系列混杂领域(例如,将描述猫雕塑和艺术作品的图像进行统一),对相关但不同的对象类别进行描述(例如,大比例化、狗和老虎等),对大规模数据进行测试的方式进行大规模测试。