Vision language (VL) models like CLIP are robust to natural distribution shifts, in part because CLIP learns on unstructured data using a technique called caption supervision; the model inteprets image-linked texts as ground-truth labels. In a carefully controlled comparison study, we show that caption-supervised CNNs trained on a standard cross-entropy loss (with image labels assigned by scanning captions for class names) can exhibit greater distributional robustness than VL models trained on the same data. To facilitate future experiments with high-accuracy caption-supervised models, we introduce CaptionNet (https://github.com/penfever/CaptionNet/), which includes a class-balanced, fully supervised dataset with over 50,000 new human-labeled ImageNet-compliant samples which includes web-scraped captions. In a series of experiments on CaptionNet, we show how the choice of loss function, data filtration and supervision strategy enable robust computer vision. We also provide the codebase necessary to reproduce our experiments at VL Hub (https://github.com/penfever/vlhub/).
翻译:类似 CLIP 的视觉语言( VL) 模型对自然分布变化非常有力, 部分原因是 CLIP 使用称为标题监督的技术学习非结构化数据; 模型前端与图像挂钩的文本作为地面真相标签。 在一项仔细控制的比较研究中, 我们显示, 受字幕监督的有线电视新闻网在标准跨物种损失方面受过培训( 由类名扫描字幕分配的图像标签) 的分布性强度比受同一数据培训的VL 模型要强。 为了便利高精度字幕监督模型的未来实验, 我们引入了CaptionNet (https://github.com/penfever/CaptionNet/), 其中包括一个班级平衡的、 完全受监督的数据集, 包含5万多个新的人类标签图像网络符合要求的样本, 其中包括网络刻画的字幕。 在CaptionNet 的一系列实验中, 我们展示了损失功能的选择、 数据过滤和监督策略是如何实现稳健健的计算机视觉的。 我们还提供了复制 VL HUB (http://fever/ penfer) 。