Robots operating in unstructured environments require a comprehensive understanding of their surroundings, necessitating geometric and semantic information from sensor data. Traditional RGB-D processing pipelines focus primarily on geometric reconstruction, limiting their ability to support advanced robotic perception, planning, and interaction. A key challenge is the lack of generalized methods for segmenting RGB-D data into semantically meaningful components while maintaining accurate geometric representations. We introduce a novel end-to-end modular pipeline that integrates state-of-the-art semantic segmentation, human tracking, point-cloud fusion, and scene reconstruction. Our approach improves semantic segmentation accuracy by leveraging the foundational segmentation model SAM2 with a hybrid method that combines its mask generation with a semantic classification model, resulting in sharper masks and high classification accuracy. Compared to SegFormer and OneFormer, our method achieves a similar semantic segmentation accuracy (mIoU of 47.0% vs 45.9% in the ADE20K dataset) but provides much more precise object boundaries. Additionally, our human tracking algorithm interacts with the segmentation enabling continuous tracking even when objects leave and re-enter the frame by object re-identification. Our point cloud fusion approach reduces computation time by 1.81x while maintaining a small mean reconstruction error of 25.3 mm by leveraging the semantic information. We validate our approach on benchmark datasets and real-world Kinect RGB-D data, demonstrating improved efficiency, accuracy, and usability. Our structured representation, stored in the Universal Scene Description (USD) format, supports efficient querying, visualization, and robotic simulation, making it practical for real-world deployment.
翻译:暂无翻译