Common Deep Metric Learning (DML) datasets specify only one notion of similarity, e.g., two images in the Cars196 dataset are deemed similar if they show the same car model. We argue that depending on the application, users of image retrieval systems have different and changing similarity notions that should be incorporated as easily as possible. Therefore, we present Language-Guided Zero-Shot Deep Metric Learning (LanZ-DML) as a new DML setting in which users control the properties that should be important for image representations without training data by only using natural language. To this end, we propose InDiReCT (Image representations using Dimensionality Reduction on CLIP embedded Texts), a model for LanZ-DML on images that exclusively uses a few text prompts for training. InDiReCT utilizes CLIP as a fixed feature extractor for images and texts and transfers the variation in text prompt embeddings to the image embedding space. Extensive experiments on five datasets and overall thirteen similarity notions show that, despite not seeing any images during training, InDiReCT performs better than strong baselines and approaches the performance of fully-supervised models. An analysis reveals that InDiReCT learns to focus on regions of the image that correlate with the desired similarity notion, which makes it a fast to train and easy to use method to create custom embedding spaces only using natural language.
翻译:通用深磁学习(DML)数据集只指定一个相似概念,例如,Cars196数据集中的两张图像如果显示相同的汽车模型,则被视为相似。我们争辩说,视应用情况而定,图像检索系统的用户有不同且不断变化的相似概念,应当尽可能容易地纳入。因此,我们将语言引导零热深深深米学习(LanZ-DML)作为一个新的DML设置,用户在其中控制对图像显示至关重要的属性,而无需仅使用自然语言进行培训数据。为此,我们提议使用IndiReCT(使用CLIP嵌入文本的简化度减少度图像的图像表示),这是LanZ-DML图像的模型,该模型专门使用少量文本提示来进行培训。在DiReCT中,将CLIP作为固定的特性提取器,并将文本快速嵌入空间的变异功能传输给图像嵌入空间。五个数据集的广泛实验和整个13个类似概念显示,尽管在培训期间没有看到任何简单图像,但DIRECT的精确性分析方法显示,该模型的精确性分析比精确性模型更精确地展示了精确的基线和精确性方法。