We present a method for joint alignment of sparse in-the-wild image collections of an object category. Most prior works assume either ground-truth keypoint annotations or a large dataset of images of a single object category. However, neither of the above assumptions hold true for the long-tail of the objects present in the world. We present a self-supervised technique that directly optimizes on a sparse collection of images of a particular object/object category to obtain consistent dense correspondences across the collection. We use pairwise nearest neighbors obtained from deep features of a pre-trained vision transformer (ViT) model as noisy and sparse keypoint matches and make them dense and accurate matches by optimizing a neural network that jointly maps the image collection into a learned canonical grid. Experiments on CUB and SPair-71k benchmarks demonstrate that our method can produce globally consistent and higher quality correspondences across the image collection when compared to existing self-supervised methods. Code and other material will be made available at \url{https://kampta.github.io/asic}.
翻译:----
我们提出了一种针对物体类别的稀疏野外图像集合的联合对齐方法。大多数先前的工作假设具有地面实况关键点注释或大量单一物体类别图像数据集。然而,以上两个假设都不适用于世界上存在的物体尾部。我们提出了一种自监督技术,该技术直接在特定物体/物体类别的稀疏图像集合上进行优化,以获得跨集合的一致稠密对应关系。我们使用由预先训练的视觉变换器(ViT)模型的深度特征获得的成对最近邻作为嘈杂且稀疏的关键点匹配,并通过优化神经网络将它们变为密集且准确的匹配,将图像集合联合映射到学习的规范网格中。在CUB和SPair-71k基准测试上的实验表明,与现有的自监督方法相比,我们的方法可以在整个图像集合中产生全局一致且质量更高的对应关系。代码和其他材料可以在网址https://kampta.github.io/asic中获得。