While large scale pre-training has achieved great achievements in bridging the gap between vision and language, it still faces several challenges. First, the cost for pre-training is expensive. Second, there is no efficient way to handle the data noise which degrades model performance. Third, previous methods only leverage limited image-text paired data, while ignoring richer single-modal data, which may result in poor generalization to single-modal downstream tasks. In this work, we propose an EfficientCLIP method via Ensemble Confident Learning to obtain a less noisy data subset. Extra rich non-paired single-modal text data is used for boosting the generalization of text branch. We achieve the state-of-the-art performance on Chinese cross-modal retrieval tasks with only 1/10 training resources compared to CLIP and WenLan, while showing excellent generalization to single-modal tasks, including text retrieval and text classification.
翻译:虽然大规模培训前在弥合视力和语言之间的差距方面取得了巨大成就,但仍然面临若干挑战。第一,培训前的费用昂贵。第二,没有有效的办法处理降低模型性能的数据噪音。第三,以往方法只利用有限的图像文本配对数据,而忽视较丰富的单一模式数据,这可能导致对单一模式下游任务的概括化程度不高。在这项工作中,我们建议通过 " 整体自信学习 " 来获取高效的CLIP方法,以获得一个不那么吵闹的数据子集。额外丰富的非保密的单一模式文本数据被用于促进文本分支的普及化。我们实现了中国跨模式检索任务的最新业绩,与CLIP和WenLan相比,我们只提供了1/10个培训资源,同时展示了对单一模式任务,包括文字检索和文本分类的精良化。