SpatialLM is a large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object boxes with their semantic categories. Unlike previous methods which exploit task-specific network designs, our model adheres to the standard multimodal LLM architecture and is fine-tuned directly from open-source LLMs. To train SpatialLM, we collect a large-scale, high-quality synthetic dataset consisting of the point clouds of 12,328 indoor scenes (54,778 rooms) with ground-truth 3D annotations, and conduct a careful study on various modeling and training decisions. On public benchmarks, our model gives state-of-the-art performance in layout estimation and competitive results in 3D object detection. With that, we show a feasible path for enhancing the spatial understanding capabilities of modern LLMs for applications in augmented reality, embodied robotics, and more.
翻译:SpatialLM是一种专为处理三维点云数据并生成结构化三维场景理解输出而设计的大型语言模型。其输出包括墙体、门、窗户等建筑构件,以及带有语义类别的定向物体包围盒。与以往依赖特定任务网络设计的方法不同,本模型遵循标准的多模态LLM架构,并直接基于开源LLM进行微调。为训练SpatialLM,我们构建了一个大规模、高质量的合成数据集,包含12,328个室内场景(54,778个房间)的点云数据及真实三维标注,并对多种建模与训练策略进行了系统研究。在公开基准测试中,本模型在布局估计任务上取得了最先进的性能,并在三维物体检测中表现出竞争力。由此,我们展示了一条增强现代LLM空间理解能力的可行路径,可应用于增强现实、具身机器人等领域。