Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks. However, some aspects of complex language understanding still remain a challenge. We introduce the collective notion of Structured Vision&Language Concepts (SVLC) which includes object attributes, relations, and states which are present in the text and visible in the image. Recent studies have shown that even the best VL models struggle with SVLC. A possible way of fixing this issue is by collecting dedicated datasets for teaching each SVLC type, yet this might be expensive and time-consuming. Instead, we propose a more elegant data-driven approach for enhancing VL models' understanding of SVLCs that makes more effective use of existing VL pre-training datasets and does not require any additional data. While automatic understanding of image structure still remains largely unsolved, language structure is much better modeled and understood, allowing for its effective utilization in teaching VL models. In this paper, we propose various techniques based on language structure understanding that can be used to manipulate the textual part of off-the-shelf paired VL datasets. VL models trained with the updated data exhibit a significant improvement of up to 15% in their SVLC understanding with only a mild degradation in their zero-shot capabilities both when training from scratch or fine-tuning a pre-trained model.
翻译:视觉和语言(VL)模型在各种任务中表现出了显著的零弹性表现,然而,复杂的语言理解的某些方面仍是一个挑战。我们引入了结构化视觉和语言概念(SVLC)的集体概念,其中包括现有物体属性、关系和文字中可见的状态,这些概念包括了文字中存在并在图像中可见的物体属性、关系和状态。最近的研究表明,即使是最好的VL模式也与SVLLC模式相抗争。解决这个问题的一个可能办法是收集用于教授每一类SVLC模式的专用数据集,但这可能十分昂贵和费时。相反,我们建议采用一种更优雅的数据驱动方法,提高VLC模型对SVLLLL概念的理解,使现有的VLV预培训数据集得到更有效的使用,而不需要任何额外数据。虽然对图像结构的自动理解在很大程度上仍未解析,但语言结构的模型则要更好得多,以便有效地用于教授VLC模型。在语言结构上提出各种技术,可以用来在对非配的VLC模型进行文字部分的调整,而只是用其深层LC模型的升级的SLC模型,在15级前经过培训的SLC模型中,只有经过精化的升级的S-LC模型的升级的升级的升级的模型,只有制的SLC模型,只有制的升级的升级的升级的升级的升级的升级的升级的S-LCS-LCS-LCS-RCS-25的升级的升级的升级的升级的升级的模型,才能在15级的升级的升级的升级的升级的模型,只有制数据能力。