Strong image search models can be learned for a specific domain, ie. set of labels, provided that some labeled images of that domain are available. A practical visual search model, however, should be versatile enough to solve multiple retrieval tasks simultaneously, even if those cover very different specialized domains. Additionally, it should be able to benefit from even unlabeled images from these various retrieval tasks. This is the more practical scenario that we consider in this paper. We address it with the proposed Grappa, an approach that starts from a strong pretrained model, and adapts it to tackle multiple retrieval tasks concurrently, using only unlabeled images from the different task domains. We extend the pretrained model with multiple independently trained sets of adaptors that use pseudo-label sets of different sizes, effectively mimicking different pseudo-granularities. We reconcile all adaptor sets into a single unified model suited for all retrieval tasks by learning fusion layers that we guide by propagating pseudo-granularity attentions across neighbors in the feature space. Results on a benchmark composed of six heterogeneous retrieval tasks show that the unsupervised Grappa model improves the zero-shot performance of a state-of-the-art self-supervised learning model, and in some places reaches or improves over a task label-aware oracle that selects the most fitting pseudo-granularity per task.
翻译:可用于特定域( 即: 标签 ) 的强固图像搜索模型, 前提是该域有一些标签图像。 但是, 实用的视觉搜索模型应该足够多才多艺, 足以同时解决多个检索任务, 即使它们覆盖了非常不同的专门领域。 此外, 它应该能够从这些不同的检索任务中甚至没有标签的图像中受益。 这是我们在本文中考虑的更为实际的情景。 我们用拟议的Grappa 来解决这个问题, 这个方法始于一个强大的预设模型, 并同时调整它来处理多个检索任务, 仅使用来自不同任务域的未标记图像。 我们将预先培训的模型扩展为多个独立训练的适应器组, 使用不同尺寸的伪标签组, 有效地模拟不同的伪格度 。 我们将所有适应器都组合成一个单一的统一模型, 适合所有检索任务。 我们通过在地貌空间的邻居之间传播伪色调关注度来指导它。 由六种不同检索任务构成的基准显示, 未经监督的 Grappa 模型模型模型将改进一个最高级的、 或最高级的标签任务升级的状态, 将改进一个最高级任务升级的点的自我定位, 。