Recent works have shown that unstructured text (documents) from online sources can serve as useful auxiliary information for zero-shot image classification. However, these methods require access to a high-quality source like Wikipedia and are limited to a single source of information. Large Language Models (LLM) trained on web-scale text show impressive abilities to repurpose their learned knowledge for a multitude of tasks. In this work, we provide a novel perspective on using an LLM to provide text supervision for a zero-shot image classification model. The LLM is provided with a few text descriptions from different annotators as examples. The LLM is conditioned on these examples to generate multiple text descriptions for each class(referred to as views). Our proposed model, I2MVFormer, learns multi-view semantic embeddings for zero-shot image classification with these class views. We show that each text view of a class provides complementary information allowing a model to learn a highly discriminative class embedding. Moreover, we show that I2MVFormer is better at consuming the multi-view text supervision from LLM compared to baseline models. I2MVFormer establishes a new state-of-the-art on three public benchmark datasets for zero-shot image classification with unsupervised semantic embeddings.
翻译:最近的工作表明,在线来源的非结构化文本(文件)可以作为零光图像分类的有用辅助信息。然而,这些方法需要访问像维基百科这样的高质量来源,并且仅限于单一的信息来源。在网络规模文本上受过培训的大型语言模型(LLM)显示有惊人的能力为多项任务重新利用所学知识。在这项工作中,我们从新角度介绍了如何使用LLMM为零光图像分类模式提供文本监督。LMM提供了来自不同说明者的一些文本描述作为实例。LM以这些示例为条件,为每个类别(称为视图)生成多个文本描述。我们提议的模型I2MVFormer学习了以这些类别观点进行零光图像分类的多视图语义嵌入。我们显示,每类的每个文本视图都提供补充信息,使模型能够学习高度具有歧视性的类嵌入。此外,我们显示I2MVFormer在将多种版本的文本监督从LMM到基线模型之间,以生成多个文本的描述。I2MV-Former在三级的模型上,以新的州级模型为基准,建立了新的标准的模型。