Humans are able to perceive, understand and reason about physical events. Developing models with similar physical understanding capabilities is a long-standing goal of artificial intelligence. As a step towards this goal, in this work, we introduce CRAFT, a new visual question answering dataset that requires causal reasoning about physical forces and object interactions. It contains 58K video and question pairs that are generated from 10K videos from 20 different virtual environments, containing various objects in motion that interact with each other and the scene. Two question categories from CRAFT include previously studied descriptive and counterfactual questions. Besides, inspired by the theories of force dynamics in cognitive linguistics, we introduce new question categories that involve understanding the interactions of objects through the notions of cause, enable, and prevent. Our results demonstrate that even though these tasks seem to be simple and intuitive for humans, the evaluated baseline models, including existing state-of-the-art methods, do not yet deal with the challenges posed in our benchmark dataset.
翻译:人类能够感知、理解和理解物理事件。 开发具有类似物理理解能力的模型是人造智能的长期目标。 作为实现这一目标的一个步骤,我们在这项工作中引入了CRAFT,这是一个新的视觉问题解答数据集,需要关于物理力量和物体相互作用的因果关系推理。它包含来自20个不同虚拟环境的10K视频产生的58K视频和问题配对,包含各种运动物体,彼此互动和场景。CRAFT的两个问题类别包括先前研究过的描述性和反事实性问题。此外,在认知语言中的力量动态理论的启发下,我们引入了新问题类别,涉及通过事业、扶持和预防的概念理解物体的相互作用。我们的结果表明,尽管这些任务对于人类来说似乎是简单和直观的,但经过评估的基线模型,包括现有的最新方法,尚未应对我们基准数据集中的挑战。