Modern image retrieval methods typically rely on fine-tuning pre-trained encoders to extract image-level descriptors. However, the most widely used models are pre-trained on ImageNet-1K with limited classes. The pre-trained feature representation is therefore not universal enough to generalize well to the diverse open-world classes. In this paper, we first cluster the large-scale LAION400M into one million pseudo classes based on the joint textual and visual features extracted by the CLIP model. Due to the confusion of label granularity, the automatically clustered dataset inevitably contains heavy inter-class conflict. To alleviate such conflict, we randomly select partial inter-class prototypes to construct the margin-based softmax loss. To further enhance the low-dimensional feature representation, we randomly select partial feature dimensions when calculating the similarities between embeddings and class-wise prototypes. The dual random partial selections are with respect to the class dimension and the feature dimension of the prototype matrix, making the classification conflict-robust and the feature embedding compact. Our method significantly outperforms state-of-the-art unsupervised and supervised image retrieval approaches on multiple benchmarks. The code and pre-trained models are released to facilitate future research https://github.com/deepglint/unicom.
翻译:现代图像检索方法通常依赖于微调预训练编码器以提取图像级描述符。然而,最广泛使用的模型是在ImageNet-1K上预训练的,仅包含有限的类别。预训练的特征表示因此不够通用,不能很好地推广到多样的开放世界类别。在本文中,我们首先基于CLIP模型提取的联合文本和视觉特征将大规模的LAION400M聚类成一百万个伪类。由于标签粒度的混淆,自动聚类的数据集不可避免地包含严重的类间冲突。为了减轻这种冲突,我们随机选择部分类间原型来构建基于边缘的softmax损失。为了进一步增强低维特征表示,我们在计算嵌入和类别原型之间的相似度时随机选择部分特征维度。双重随机局部选择是针对原型矩阵的类别维度和特征维度进行的,使得分类冲突鲁棒且特征嵌入紧凑。我们的方法在多个基准测试中显著优于最先进的无监督和监督图像检索方法。我们已经发布了代码和预训练模型,以促进未来的研究:https://github.com/deepglint/unicom。