For vision-and-language reasoning tasks, both fully connectionist, end-to-end methods and hybrid, neuro-symbolic methods have achieved high in-distribution performance. In which out-of-distribution settings does each paradigm excel? We investigate this question on both single-image and multi-image visual question-answering through four types of generalization tests: a novel segment-combine test for multi-image queries, contrast set, compositional generalization, and cross-benchmark transfer. Vision-and-language end-to-end trained systems exhibit sizeable performance drops across all these tests. Neuro-symbolic methods suffer even more on cross-benchmark transfer from GQA to VQA, but they show smaller accuracy drops on the other generalization tests and their performance quickly improves by few-shot training. Overall, our results demonstrate the complementary benefits of these two paradigms, and emphasize the importance of using a diverse suite of generalization tests to fully characterize model robustness to distribution shift.
翻译:对于视觉和语言的推理任务,包括完全连接、端到端方法以及混合、神经-精神-侧向方法,都取得了很高的分布性能。在其中,分配外设置是每个范例都优异的?我们通过四类一般化测试来调查单一图像和多图像视觉问题解答的问题:一个用于多图像查询、对比集、构成通用和跨基准传输的新型分部分群测试。 视觉和语言端到端的训练系统在所有这些测试中都表现出相当大的性能下降。 神经-同步方法在从GQA到VQA的交叉基准转移方面甚至遭受更多的痛苦,但是它们显示了其他一般化测试的精度下降,其性能通过微小的培训迅速改善。总体而言,我们的结果显示了这两个范例的互补效益,并强调了使用多种通用测试来充分描述模型稳健性以进行分配转移的重要性。