State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a "foundation", that targets all modalities at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate impressive performance on a wide range of 35 tasks spanning these target modalities.
翻译:最先进的愿景和愿景及语言模式依靠大规模语言前期培训,才能在各种下游任务中取得良好业绩,一般而言,这些模式往往是跨模式(交替性)或多模式(与较早的融合性),但并非两者兼而有之;它们往往仅针对特定模式或任务;一个有希望的方向是使用单一的整体通用模式,作为“基础”,同时针对所有模式,真正愿景和语言基础模式应善于执行愿景任务、语言任务以及跨模式和多模式愿景和语言任务;我们采用FLAVA作为模式,在涉及这些目标模式的35项广泛任务上表现出令人印象深刻的业绩。