Language-Image Pre-training has demonstrated promising results on zero-shot and few-shot downstream tasks by prompting visual models with natural language prompts. However, most recent studies only use a single prompt for tuning, neglecting the inherent step-to-step cognitive reasoning process that humans conduct in complex task settings, for example, when processing images from unfamiliar domains. Chain of Thought is a simple and effective approximation to human reasoning process and has been proven useful for natural language processing (NLP) tasks. Based on this cognitive intuition, we believe that conducting effective reasoning is also an important problem in visual tasks, and a chain of thought could be a solution to this problem. In this work, we propose a novel chain of thought prompt tuning for vision-language modeling. Extensive experiments show that our method not only generalizes better in image classification tasks, has greater transferability beyond a single dataset, and has stronger domain generalization performance, but also performs much better in imagetext retrieval and visual question answering, which require more reasoning capabilities. We are the first to successfully adapt chain-of-thought prompting that combines visual and textual embeddings. We will release our codes
翻译:语言-图像预训练通过自然语言提示使视觉模型具有零样本和少样本下游任务的潜力。然而,最近的研究仅使用单个提示进行调整,忽略了人类在处理来自陌生领域的图像时进行的内在步骤到步骤的认知推理过程,例如,在复杂任务设置中。思维链是人类推理过程的简单且有效的近似值,并已被证明在自然语言处理(NLP)任务中非常有用。基于这种认知直觉,我们认为进行有效的推理也是视觉任务中的一个重要问题,思维链可能是解决这个问题的方法。在这项工作中,我们提出了一种新的视觉语言建模的思维链提示调整方法。大量实验证明我们的方法不仅在图像分类任务中具有更好的泛化性能、超越单个数据集的传递性能更强,而且在需要更多推理能力的图像文本检索和视觉问答中表现出更好的性能。我们是第一个成功适应将视觉和文本嵌入组合的思维链提示的研究者。我们将发布我们的代码。