Vision-language models trained on large, randomly collected data had significant impact in many areas since they appeared. But as they show great performance in various fields, such as image-text-retrieval, their inner workings are still not fully understood. The current work analyses the true zero-shot capabilities of those models. We start from the analysis of the training corpus assessing to what extent (and which of) the test classes are really zero-shot and how this correlates with individual classes performance. We follow up with the analysis of the attribute-based zero-shot learning capabilities of these models, evaluating how well this classical zero-shot notion emerges from large-scale webly supervision. We leverage the recently released LAION400M data corpus as well as the publicly available pretrained models of CLIP, OpenCLIP, and FLAVA, evaluating the attribute-based zero-shot capabilities on CUB and AWA2 benchmarks. Our analysis shows that: (i) most of the classes in popular zero-shot benchmarks are observed (a lot) during pre-training; (ii) zero-shot performance mainly comes out of models' capability of recognizing class labels, whenever they are present in the text, and a significantly lower performing capability of attribute-based zeroshot learning is only observed when class labels are not used; (iii) the number of the attributes used can have a significant effect on performance, and can easily cause a significant performance decrease.
翻译:就大量随机收集的数据所培训的视觉语言模型自出现以来在许多领域产生了重大影响。但是,由于这些模型显示在图像文本检索等各个领域的出色表现,它们的内部功能仍然不完全被理解。目前的工作分析这些模型的真正零射能力。我们从对培训教材的分析开始,评估测试课的分数零射能力在多大程度上(和哪个)真正是零射,以及这与各个班级的成绩有何关联。我们跟踪这些模型的属性零射学习能力分析,评估这些模型在大型网络监督中出现的典型零射概念如何良好。我们利用最近发布的LAION400M数据集以及公开提供的CLIP、OploCLIP和FLAVA等预设模式,评估测试课的分数零射能力在多大程度上(和哪个)真正是零射能力,如何与各个班级的成绩挂钩。我们的分析表明:(一)在培训前,大多数流行零射基准的班级都得到观察(很多);(二)零射分的成绩主要出模型的成绩能力,在不明显使用等级的成绩标签时,只有低的成绩才能进行重大的成绩评分。