CLIP 用于零样本基于素描的图像检索，无论是细粒度还是非细粒度的 (CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not)

In this paper, we leverage CLIP for zero-shot sketch based image retrieval (ZS-SBIR). We are largely inspired by recent advances on foundation models and the unparalleled generalisation ability they seem to offer, but for the first time tailor it to benefit the sketch community. We put forward novel designs on how best to achieve this synergy, for both the category setting and the fine-grained setting ("all"). At the very core of our solution is a prompt learning setup. First we show just via factoring in sketch-specific prompts, we already have a category-level ZS-SBIR system that overshoots all prior arts, by a large margin (24.8%) - a great testimony on studying the CLIP and ZS-SBIR synergy. Moving onto the fine-grained setup is however trickier, and requires a deeper dive into this synergy. For that, we come up with two specific designs to tackle the fine-grained matching nature of the problem: (i) an additional regularisation loss to ensure the relative separation between sketches and photos is uniform across categories, which is not the case for the gold standard standalone triplet loss, and (ii) a clever patch shuffling technique to help establishing instance-level structural correspondences between sketch-photo pairs. With these designs, we again observe significant performance gains in the region of 26.9% over previous state-of-the-art. The take-home message, if any, is the proposed CLIP and prompt learning paradigm carries great promise in tackling other sketch-related tasks (not limited to ZS-SBIR) where data scarcity remains a great challenge. Project page: https://aneeshan95.github.io/Sketch_LVM/

翻译：在本文中，我们利用CLIP进行零样本基于素描的图像检索（ZS-SBIR）。我们受到最近基本模型的进展和它们似乎提供的无与伦比的概括能力的启发，但是首次对它进行改编以受益于素描社区。我们提出了关于如何最好地实现这种协同效应的新设计，适用于类别设置和细粒度设置（"所有"）。我们解决这种协同解决方案的核心是提示学习设置。首先，我们仅通过考虑适用于素描的提示就已经有了一个类别级别的ZS-SBIR系统，其高于所有先前的艺术水平，幅度很大（24.8%），这是关于研究CLIP和ZS-SBIR协同作用的重要证明。然而，转向细粒度设置有点棘手，并需要深入研究这种协同作用。为此，我们提出了两种特定的设计来解决问题的细粒度匹配特性：（i）附加正则化损失，以确保在所有类别中素描和照片之间的相对分离是均匀的，而不是独立三元损失所采用的方法，（ii）巧妙的补丁洗牌技术，帮助建立素描-照片之间的实例级结构对应关系。通过这些设计，我们再次观察到大约 26.9% 的性能增益超过了先前的最新水平。如果有任何带回家的信息，那就是所提出的CLIP和提示学习范式在解决其他素描相关任务（不仅限于ZS-SBIR）方面具有巨大的潜力，其中数据缺乏仍然是一个巨大的挑战。项目页面:https://aneeshan95.github.io/Sketch_LVM/