Numerous visio-linguistic (V+L) representation learning methods have been developed, yet existing datasets do not adequately evaluate the extent to which they represent visual and linguistic concepts in a unified space. We propose several novel evaluation settings for V+L models, including cross-modal transfer. Furthermore, existing V+L benchmarks often report global accuracy scores on the entire dataset, making it difficult to pinpoint the specific reasoning tasks that models fail and succeed at. We present TraVLR, a synthetic dataset comprising four V+L reasoning tasks. TraVLR's synthetic nature allows us to constrain its training and testing distributions along task-relevant dimensions, enabling the evaluation of out-of-distribution generalisation. Each example in TraVLR redundantly encodes the scene in two modalities, allowing either to be dropped or added during training or testing without losing relevant information. We compare the performance of four state-of-the-art V+L models, finding that while they perform well on test examples from the same modality, they all fail at cross-modal transfer and have limited success accommodating the addition or deletion of one modality. We release TraVLR as an open challenge for the research community.
翻译:许多视觉语言(V + L)表示学习方法已经被开发出来,然而现有的数据集并没有充分评估它们在统一空间中表示视觉和语言概念的程度。我们提出了几种新颖的V + L模型评估设置,包括跨模态传递。此外,现有的V + L基准通常报告整个数据集的全局准确度得分,这使得难以确定模型在哪些推理任务上失败或成功。我们提出 TraVLR,这是一个包含四个V + L推理任务的合成数据集。TraVLR的合成性质使我们能够沿着任务相关维度限制其训练和测试分布,从而实现了超出分布广义化的评估。TraVLR中的每个示例都以两种模式冗余地编码场景,允许在训练或测试期间删除或添加任何一种模式而不失去相关信息。我们比较了四种最先进的V + L模型的性能,发现它们在同一模态的测试示例上表现良好,但都无法进行跨模态传递,并且很难成功地适应添加或删除一种模式。我们将TraVLR作为研究社区的公开挑战发布。