Recent video question answering benchmarks indicate that state-of-the-art models struggle to answer compositional questions. However, it remains unclear which types of compositional reasoning cause models to mispredict. Furthermore, it is difficult to discern whether models arrive at answers using compositional reasoning or by leveraging data biases. In this paper, we develop a question decomposition engine that programmatically deconstructs a compositional question into a directed acyclic graph of sub-questions. The graph is designed such that each parent question is a composition of its children. We present AGQA-Decomp, a benchmark containing $2.3M$ question graphs, with an average of $11.49$ sub-questions per graph, and $4.55M$ total new sub-questions. Using question graphs, we evaluate three state-of-the-art models with a suite of novel compositional consistency metrics. We find that models either cannot reason correctly through most compositions or are reliant on incorrect reasoning to reach answers, frequently contradicting themselves or achieving high accuracies when failing at intermediate reasoning steps.
翻译:最近的视频答题基准显示,最先进的模型很难回答组成问题。然而,仍然不清楚哪些类型的组成推理导致错误判断模型。此外,很难确定模型是利用组成推理还是利用数据偏差得出答案。在本文中,我们开发了一个问题分解引擎,在方案上将一个组成问题分解成一个有方向的循环子问题图。图表的设计使每个家长问题都是其子女的组成。我们介绍了AGQA-Decomp,一个基准,包含2.3M美元的问题图表,每个图平均有11.49美元的次问题,以及4.55M美元的新问题。我们使用问题图,用一套新的构成一致性指标来评估三个最先进的模型。我们发现,模型要么不能用大多数组成来正确解释,要么依靠错误推理来找到答案,常常自相矛盾,要么在中间推理步骤失败时获得高度的答案。