This paper presents AlphaSpace, a novel methodology designed to enhance the spatial reasoning capabilities of language models for robotic manipulation in 3D Cartesian space. AlphaSpace employs a hierarchical semantics-based tokenization strategy that encodes spatial information at both coarse and fine-grained levels. Our approach represents objects with their attributes, positions, and height information through structured tokens, enabling precise spatial reasoning without relying on traditional vision-based embeddings. This approach enables LLMs to accurately manipulate objects by positioning them at specific (x, y, z) coordinates. Experimental results suggest that AlphaSpace demonstrates promising potential for improving manipulation tasks, achieving a total accuracy of 66.67%, compared to 37.5% for GPT-4o and 29.17% for Claude 3.5 Sonnet. These results demonstrate the potential of structured spatial encoding for manipulation tasks and warrant further exploration.
翻译:暂无翻译