Contrastively trained text-image models have the remarkable ability to perform zero-shot classification, that is, classifying previously unseen images into categories that the model has never been explicitly trained to identify. However, these zero-shot classifiers need prompt engineering to achieve high accuracy. Prompt engineering typically requires hand-crafting a set of prompts for individual downstream tasks. In this work, we aim to automate this prompt engineering and improve zero-shot accuracy through prompt ensembling. In particular, we ask "Given a large pool of prompts, can we automatically score the prompts and ensemble those that are most suitable for a particular downstream dataset, without needing access to labeled validation data?". We demonstrate that this is possible. In doing so, we identify several pathologies in a naive prompt scoring method where the score can be easily overconfident due to biases in pre-training and test data, and we propose a novel prompt scoring method that corrects for the biases. Using our proposed scoring method to create a weighted average prompt ensemble, our method outperforms equal average ensemble, as well as hand-crafted prompts, on ImageNet, 4 of its variants, and 11 fine-grained classification benchmarks, all while being fully automatic, optimization-free, and not requiring access to labeled validation data.
翻译:经过对比训练的文本图像模型具有执行零发分类的非凡能力,即将以前未见的图像分类为该模型从未受过明确训练的分类类别。然而,这些零发分类需要迅速的工程才能达到高精度。快速工程通常需要手工制作一套针对个别下游任务的提示。在这项工作中,我们的目标是通过迅速拼凑使这种快速工程自动化,提高零发率精确度。特别是,我们要求“提供大量提示,我们能否自动分清最适合特定下游数据集的提示和组合,而不需要访问标签验证数据?”我们证明这是可能的。我们这样做是为了确定一些在天真的快速评分方法中,由于培训前和测试数据的偏差,得分很容易过于自信,我们建议一种新颖的快速评分方法,纠正偏差。我们提出的评分方法是,利用我们提议的评分方法来创建一个加权平均提示组合,我们的方法优于对特定下游数据集最合适的提示和组合,而不需要访问标签的标签?我们证明这是可能的。我们这样做是为了证明这是可能的。我们通过一种天真快速的评分的评分方法,我们可以很容易地辨地辨地找出一些的标签,在11的升级的标签上,并且要求完全的变型的升级的标签上,完全地进行。