Large vision-language models are generally applicable to many downstream tasks, but come at an exorbitant training cost that only large institutions can afford. This paper trades generality for efficiency and presents Curation in Training (CiT), a simple and efficient vision-text learning algorithm that couples a data objective into training. CiT automatically yields quality data to speed-up contrastive image-text training and alleviates the need for an offline data filtering pipeline, allowing broad data sources (including raw image-text pairs from the web). CiT contains two loops: an outer loop curating the training data and an inner loop consuming the curated training data. The text encoder connects the two loops. Given metadata for tasks of interest, e.g., class names, and a large pool of image-text pairs, CiT alternatively selects relevant training data from the pool by measuring the similarity of their text embeddings and embeddings of the metadata. In our experiments, we observe that CiT can speed up training by over an order of magnitude, especially if the raw data size is large.
翻译:大型视觉语言模型一般适用于许多下游任务,但培训成本高昂,只有大机构才能负担得起。 本文以一般效率为标准,并展示了《培训指南》(CiT),这是一种简单而高效的视觉-文字学习算法,将数据目标结合到培训中。 CiT 自动生成高质量的数据,以加快对比图像文本培训,减轻对离线数据过滤管道的需求,允许广泛的数据源(包括网络原始图像-文字对等) 。 CiT 包含两个循环: 一种外环, 整理培训数据, 一种内部环, 消耗经整理的培训数据。 文本编码器连接了两个循环。 鉴于用于相关任务的元数据, 例如, 类名, 以及一大批图像- 文本配对的元数据, CiT 也可以通过测量其文本的相似性嵌入和嵌入元数据的嵌入从池中选择相关的培训数据。 在我们的实验中, 我们观察到, CiT 可以加快培训速度, 超过一个数量级, 特别是当原始数据大小大时。