Large vision-language models (VLMs), such as CLIP, learn rich joint image-text representations, facilitating advances in numerous downstream tasks, including zero-shot classification and text-to-image generation. Nevertheless, existing VLMs exhibit a prominent well-documented limitation - they fail to encapsulate compositional concepts such as counting. We introduce a simple yet effective method to improve the quantitative understanding of VLMs, while maintaining their overall performance on common benchmarks. Specifically, we propose a new counting-contrastive loss used to finetune a pre-trained VLM in tandem with its original objective. Our counting loss is deployed over automatically-created counterfactual examples, each consisting of an image and a caption containing an incorrect object count. For example, an image depicting three dogs is paired with the caption "Six dogs playing in the yard". Our loss encourages discrimination between the correct caption and its counterfactual variant which serves as a hard negative example. To the best of our knowledge, this work is the first to extend CLIP's capabilities to object counting. Furthermore, we introduce "CountBench" - a new image-text counting benchmark for evaluating a model's understanding of object counting. We demonstrate a significant improvement over state-of-the-art baseline models on this task. Finally, we leverage our count-aware CLIP model for image retrieval and text-conditioned image generation, demonstrating that our model can produce specific counts of objects more reliably than existing ones.
翻译:大型视觉语言模型(VLM),如CLIP等大型视觉语言模型(VLMM),学会了丰富的联合图像-文字表达方式,促进了许多下游任务的进展,包括零光分类和文字到图像生成。然而,现有的VLMS展示了一个显著的有据可查的限制,它们没有包含诸如计数等构成概念。我们引入了一个简单而有效的方法,以提高对VLM的量化理解,同时保持其在共同基准上的总体业绩。具体地说,我们提出了一种新的计数-调损失,用于根据原始目标对受过训练的VLM进行微调。我们计算损失的时间是自动创建的反事实例子,每个例子包括含有不正确的对象计数的图像和标题。例如,描述三只狗的图像与标题“在院子里玩的Six狗”相配对。我们的损失鼓励了正确标题与其反事实变量之间的差别,同时保持了共同基准。我们所了解的模型是,这项工作首先将CLIP的能力扩大到了目标计数。此外,我们引入了“Contochch”(Countbet)——一个显示我们现有基准模型的新的图像模型的升级模型,最后显示了我们对基准的模型。