We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D). Given a scene context (e.g., 3D scan), SQA3D requires the tested agent to first understand its situation (position, orientation, etc.) in the 3D scene as described by text, then reason about its surrounding environment and answer a question under that situation. Based upon 650 scenes from ScanNet, we provide a dataset centered around 6.8k unique situations, along with 20.4k descriptions and 33.4k diverse reasoning questions for these situations. These questions examine a wide spectrum of reasoning capabilities for an intelligent agent, ranging from spatial relation comprehension to commonsense understanding, navigation, and multi-hop reasoning. SQA3D imposes a significant challenge to current multi-modal especially 3D reasoning models. We evaluate various state-of-the-art approaches and find that the best one only achieves an overall score of 47.20%, while amateur human participants can reach 90.06%. We believe SQA3D could facilitate future embodied AI research with stronger situation understanding and reasoning capability.
翻译:我们提出了一项新任务,以基于情境的代理为实验对象来评估环境理解能力:3D场景中的情境问答(SQA3D)任务。在给定的场景上下文(例如3D扫描)中,SQA3D首先要求被测试代理根据文字描述理解其在3D场景中的情境(位置、方向等),然后通过该情境进行推理并回答问题。基于来自ScanNet的650个场景,我们提供了一个围绕着6.8k个独特情境的数据集,其中包括20.4k个场景描述和33.4k个各式各样的情境问答。这些问题考察了智能代理的广泛推理能力,包括空间关系理解、常识理解、导航和多跳推理等。SQA3D为当前的多模特别是3D推理模型提出了重大挑战。我们评估了各种最先进的方法,并发现最好的方法只能实现47.20%的总体得分,而业余人类参与者可以达到90.06%。我们相信SQA3D可以促进未来具有更强的情境理解和推理能力的交互式AI研究。