Vision and Language Pretraining has become the prevalent approach for tackling multimodal downstream tasks. The current trend is to move towards ever larger models and pretraining datasets. This computational headlong rush does not seem reasonable in the long term to move toward sustainable solutions, and de facto excludes academic laboratories with limited resources. In this work, we propose a new framework, dubbed ViCHA, that efficiently exploits the input data to boost the learning by: (a) a new hierarchical cross-modal alignment loss, (b) new self-supervised scheme based on masked image modeling, (c) leveraging image-level annotations, called Visual Concepts, obtained with existing foundation models such as CLIP to boost the performance of the image encoder. Although pretrained on four times less data, our ViCHA strategy outperforms other approaches on several downstream tasks such as Image-Text Retrieval, VQA, Visual Reasoning, Visual Entailment and Visual Grounding. The code will be made publicly available here: https://github.com/mshukor/ViCHA
翻译:目前的趋势是走向日益扩大的模型和预培训数据集。从长远看,这种计算式的粗心大意似乎不合理,以转向可持续的解决方案,事实上排除了资源有限的学术实验室。在这项工作中,我们提出了一个称为ViCHA的新框架,有效地利用输入数据促进学习,其方法是:(a) 一个新的跨模式等级交叉调整损失,(b) 基于蒙面图像模型的新自我监督计划,(c) 利用图像级别的说明,称为视觉概念,利用CLIP等现有基础模型获得,以提高图像编码器的性能。虽然我们的ViCHA战略在培训数据方面提前了四倍,但在图像-Text Retrieval、VQA、视觉理性、视觉成像和视觉定位等一些下游任务上超越了其他方法。代码将在这里公布:https://github.com/mshukor/ViCHA。该代码将在这里公布:https://github.com/ViCHA。