This work proposes POMP, a prompt pre-training method for vision-language models. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-thousand classes. Once pre-trained, the prompt with a strong transferable ability can be directly plugged into a variety of visual recognition tasks including image classification, semantic segmentation, and object detection, to boost recognition performances in a zero-shot manner. Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 downstream datasets, e.g., 67.0% average accuracy on 10 classification dataset (+3.1% compared to CoOp) and 84.4 hIoU on open-vocabulary Pascal VOC segmentation (+6.9 compared to ZSSeg).
翻译:本文提出了一种视觉-语言模型的Prompt预训练方法POMP。POMP方法在实现高效的同时,能够学习到语义信息丰富的二万多个视觉概念,从而具备强大的可迁移能力。一旦预训练完成,所学习到的Prompt可以直接应用于各种视觉识别任务,包括图像分类、语义分割和目标检测,以实现零样本学习中的识别性能提升。经验评估表明,POMP在21个下游数据集上的性能达到了最先进水平,例如在10个分类数据集上的平均准确率为67.0%(比CoOp高3.1%),在Pascal VOC开放词汇语义分割上的84.4 hIoU达到最佳水平(比ZSSeg高6.9)。