Vision models trained on multimodal datasets can benefit from the wide availability of large image-caption datasets. A recent model (CLIP) was found to generalize well in zero-shot and transfer learning settings. This could imply that linguistic or "semantic grounding" confers additional generalization abilities to the visual feature space. Here, we systematically evaluate various multimodal architectures and vision-only models in terms of unsupervised clustering, few-shot learning, transfer learning and adversarial robustness. In each setting, multimodal training produced no additional generalization capability compared to standard supervised visual training. We conclude that work is still required for semantic grounding to help improve vision models.
翻译:在多式数据集方面受过培训的愿景模型可受益于大量图像描述数据集的广泛提供,最近发现有一个模型(CLIP)在零光和转移学习环境中非常普及,这可能意味着语言或“语系地基”使视觉特征空间具有更多的一般化能力。在这里,我们系统地评估各种多式联运架构和仅视点模式,包括无监督的集群、少量的学习、转让学习和对抗性强健。在每种环境下,与标准监督的视觉培训相比,多式培训没有产生额外的一般化能力。我们的结论是,语系地基仍需开展工作,以帮助改进视觉模型。