Vision-language models (VLMs) such as CLIP have shown promising performance on a variety of recognition tasks using the standard zero-shot classification procedure -- computing similarity between the query image and the embedded words for each category. By only using the category name, they neglect to make use of the rich context of additional information that language affords. The procedure gives no intermediate understanding of why a category is chosen, and furthermore provides no mechanism for adjusting the criteria used towards this decision. We present an alternative framework for classification with VLMs, which we call classification by description. We ask VLMs to check for descriptive features rather than broad categories: to find a tiger, look for its stripes; its claws; and more. By basing decisions on these descriptors, we can provide additional cues that encourage using the features we want to be used. In the process, we can get a clear idea of what features the model uses to construct its decision; it gains some level of inherent explainability. We query large language models (e.g., GPT-3) for these descriptors to obtain them in a scalable way. Extensive experiments show our framework has numerous advantages past interpretability. We show improvements in accuracy on ImageNet across distribution shifts; demonstrate the ability to adapt VLMs to recognize concepts unseen during training; and illustrate how descriptors can be edited to effectively mitigate bias compared to the baseline.
翻译:视觉语言模型(VLMS),如CLIP(VLMS),在使用标准零点分类程序(计算查询图像与每类嵌入词的相似性)的不同识别任务中表现出了有希望的绩效。它们仅仅使用类别名称,就忽略了使用语言所需要额外信息的丰富背景。该程序没有中间理解为什么选择一个类别,也没有为调整用于此决定的标准提供机制。我们为VLMS的分类提供了一个替代框架,我们称之为分类。我们请VLMS检查描述特征,而不是大类:寻找老虎,寻找其条纹;它的爪子;以及更多。通过将这些定义建立在这些描述符上,我们可以提供更多提示,鼓励使用我们想要使用的特点。在这个过程中,我们可以清楚地了解模型用来构建其决定的特征;它具有某种程度的内在解释性。我们对这些描述性语言模型(例如,GPT-3)进行查询,以便以可调适的方式获得描述性特征:查找老虎,寻找其条纹;它的爪子;以及更多的。通过将这些描述,我们的广泛实验能够展示我们的模型的精确性,从而显示在以往培训中如何改进。