Visual Question Answering (VQA) has witnessed tremendous progress in recent years. However, most efforts only focus on the 2D image question answering tasks. In this paper, we present the first attempt at extending VQA to the 3D domain, which can facilitate artificial intelligence's perception of 3D real-world scenarios. Different from image based VQA, 3D Question Answering (3DQA) takes the color point cloud as input and requires both appearance and 3D geometry comprehension ability to answer the 3D-related questions. To this end, we propose a novel transformer-based 3DQA framework \textbf{``3DQA-TR"}, which consists of two encoders for exploiting the appearance and geometry information, respectively. The multi-modal information of appearance, geometry, and the linguistic question can finally attend to each other via a 3D-Linguistic Bert to predict the target answers. To verify the effectiveness of our proposed 3DQA framework, we further develop the first 3DQA dataset \textbf{``ScanQA"}, which builds on the ScanNet dataset and contains $\sim$6K questions, $\sim$30K answers for $806$ scenes. Extensive experiments on this dataset demonstrate the obvious superiority of our proposed 3DQA framework over existing VQA frameworks, and the effectiveness of our major designs. Our code and dataset will be made publicly available to facilitate the research in this direction.
翻译:视觉问题解答( VQA ) 近些年来取得了巨大进展。 然而, 大部分努力只集中在 2D 图像解答任务上。 在本文中, 我们首次尝试将 VQA 扩展至 3D 域, 这有助于人工智能对 3D 真实世界情景的感知。 不同于基于 VQA 的图像、 3D 问题解答( 3D QA ), 以彩色点云为输入, 需要外观和 3D 几何解解解答能力来解答 3D 相关问题 。 为此, 我们提出了一个新的基于 3D 的 3D 图像解答框架 3D 3D 。 3D 3D QA 框架 $\ textb\ 3D 3D QA- TR} 。 由两个编码组成, 用来分别用于 利用外观和几何世界 3D 的图像解析信息的多模式 。 3DQQQ 和我们现有的数据解算法 。