Numerous visio-linguistic (V+L) representation learning methods have been developed, yet existing datasets do not evaluate the extent to which they represent visual and linguistic concepts in a unified space. Inspired by the crosslingual transfer and psycholinguistics literature, we propose a novel evaluation setting for V+L models: zero-shot cross-modal transfer. Existing V+L benchmarks also often report global accuracy scores on the entire dataset, rendering it difficult to pinpoint the specific reasoning tasks that models fail and succeed at. To address this issue and enable the evaluation of cross-modal transfer, we present TraVLR, a synthetic dataset comprising four V+L reasoning tasks. Each example encodes the scene bimodally such that either modality can be dropped during training/testing with no loss of relevant information. TraVLR's training and testing distributions are also constrained along task-relevant dimensions, enabling the evaluation of out-of-distribution generalisation. We evaluate four state-of-the-art V+L models and find that although they perform well on the test set from the same modality, all models fail to transfer cross-modally and have limited success accommodating the addition or deletion of one modality. In alignment with prior work, we also find these models to require large amounts of data to learn simple spatial relationships. We release TraVLR as an open challenge for the research community.
翻译:已经开发了许多语言(V+L)代表制学习方法,但现有的数据集并没有评估它们代表统一空间的视觉和语言概念的程度。在跨语言传输和心理语言学文献的启发下,我们提议为V+L模型建立一个新的评价环境:零弹跨模式转让。现有的V+L基准还经常报告整个数据集的全球准确度分数,从而难以确定模型失败和成功的具体推理任务。为了解决这一问题,并能够评价跨模式转让,我们提出了一个由四种V+L推理任务组成的合成数据集。每个示例都用双式双式编码说明,在培训/测试期间,两种模式都可以在不丢失相关信息的情况下放弃。TraVLLR的培训和测试分布也与任务相关,使得能够评估分配外的通用性评估。我们评估了四种最先进的V+L模型,并发现尽管它们从同一模式的测试集中运行得很好,但所有模型都无法将跨模式转移,在培训/测试过程中,并且没有丢失相关信息。TraVLLR的培训和测试分布也限制了我们之前的大规模学习模式。