We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D). Given a scene context (e.g., 3D scan), SQA3D requires the tested agent to first understand its situation (position, orientation, etc.) in the 3D scene as described by text, then reason about its surrounding environment and answer a question under that situation. Based upon 650 scenes from ScanNet, we provide a dataset centered around 6.8k unique situations, along with 20.4k descriptions and 33.4k diverse reasoning questions for these situations. These questions examine a wide spectrum of reasoning capabilities for an intelligent agent, ranging from spatial relation comprehension to commonsense understanding, navigation, and multi-hop reasoning. SQA3D imposes a significant challenge to current multi-modal especially 3D reasoning models. We evaluate various state-of-the-art approaches and find that the best one only achieves an overall score of 47.20%, while amateur human participants can reach 90.06%. We believe SQA3D could facilitate future embodied AI research with stronger situation understanding and reasoning capability.
翻译:我们提出一项新的任务,以衡量对体现物剂的实地理解:在3D场景(SQA3D)中的位置问题解答。鉴于场景背景(例如3D扫描),SQA3D要求测试物剂首先了解文字描述的3D场景中的情况(位置、方向等),然后了解周围环境,并回答这种情况下的问题。根据ScanNet的650场景,我们提供了围绕6.8k独特情况的数据集,同时提供了20.4k描述和33.4k 不同推理问题。这些问题考察了智能物剂的广泛推理能力,从空间关系理解到常识理解、导航和多动推理。SQA3D对目前的多模式,特别是3D推理模型提出了重大挑战。我们评估了各种最先进的方法,发现最好的方法只能达到47.20%的总分,而业余人类参与者可以达到90.06 %。我们认为SQA3D可以促进未来体现具有较强的情景理解和推理能力的AI研究。