大型语言模型与三维视觉在智能机器人感知与自主性中的应用 (Large Language Models and 3D Vision for Intelligent Robotic Perception and Autonomy)

With the rapid advancement of artificial intelligence and robotics, the integration of Large Language Models (LLMs) with 3D vision is emerging as a transformative approach to enhancing robotic sensing technologies. This convergence enables machines to perceive, reason and interact with complex environments through natural language and spatial understanding, bridging the gap between linguistic intelligence and spatial perception. This review provides a comprehensive analysis of state-of-the-art methodologies, applications and challenges at the intersection of LLMs and 3D vision, with a focus on next-generation robotic sensing technologies. We first introduce the foundational principles of LLMs and 3D data representations, followed by an in-depth examination of 3D sensing technologies critical for robotics. The review then explores key advancements in scene understanding, text-to-3D generation, object grounding and embodied agents, highlighting cutting-edge techniques such as zero-shot 3D segmentation, dynamic scene synthesis and language-guided manipulation. Furthermore, we discuss multimodal LLMs that integrate 3D data with touch, auditory and thermal inputs, enhancing environmental comprehension and robotic decision-making. To support future research, we catalog benchmark datasets and evaluation metrics tailored for 3D-language and vision tasks. Finally, we identify key challenges and future research directions, including adaptive model architectures, enhanced cross-modal alignment and real-time processing capabilities, which pave the way for more intelligent, context-aware and autonomous robotic sensing systems.

翻译：随着人工智能和机器人技术的快速发展，大型语言模型（LLMs）与三维视觉的融合正成为一种变革性方法，以提升机器人感知技术。这种融合使机器能够通过自然语言和空间理解来感知、推理并与复杂环境交互，从而弥合语言智能与空间感知之间的鸿沟。本文综述全面分析了LLMs与三维视觉交叉领域的最先进方法、应用及挑战，重点关注下一代机器人感知技术。首先介绍了LLMs和三维数据表示的基础原理，随后深入探讨了对机器人至关重要的三维传感技术。接着，综述探讨了场景理解、文本到三维生成、物体定位及具身智能体等关键进展，重点介绍了零样本三维分割、动态场景合成和语言引导操作等前沿技术。此外，我们讨论了整合三维数据与触觉、听觉及热输入的多模态LLMs，以增强环境理解和机器人决策能力。为支持未来研究，我们整理了针对三维-语言和视觉任务定制的基准数据集和评估指标。最后，我们指出了关键挑战和未来研究方向，包括自适应模型架构、增强的跨模态对齐及实时处理能力，这些方向为更智能、上下文感知和自主的机器人感知系统铺平了道路。