VO-DP：面向纯视觉机器人操作的语义-几何自适应扩散策略 (VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation)

In the context of imitation learning, visuomotor-based diffusion policy learning is one of the main directions in robotic manipulation. Most of these approaches rely on point clouds as observation inputs and construct scene representations through point clouds feature learning, which enables them to achieve remarkable accuracy. However, the existing literature lacks an in-depth exploration of vision-only solutions that have significant potential. In this paper, we propose a Vision-Only and single-view Diffusion Policy learning method (VO-DP) that leverages pretrained visual foundation models to achieve effective fusion of semantic and geometric features. We utilize intermediate features from VGGT incorporating semantic features from DINOv2 and geometric features from Alternating Attention blocks. Features are fused via cross-attention and spatially compressed with a CNN to form the input to the policy head. Extensive experiments demonstrate that VO-DP not only outperforms the vision-only baseline DP significantly but also exhibits distinct performance trends against the point cloud-based method DP3: in simulation tasks, VO-DP achieves an average success rate of 64.6% on par with DP3 64.0% and far higher than DP 34.8%, while in real-world tasks, it reaches 87.9%, outperforming both DP3 67.5% and DP 11.2% by a notable margin. Further robustness evaluations confirm that VO-DP remains highly stable under varying conditions including color, size, background, and lighting. Lastly, we open-source a training library for robotic manipulation. Built on Accelerate, this library supports multi-machine and multi-GPU parallel training, as well as mixed precision training. It is compatible with visuomotor policies such as DP, DP3 and VO-DP, and also supports the RoboTwin simulator.

翻译：在模仿学习领域，基于视觉运动的扩散策略学习是机器人操作的主要研究方向之一。现有方法大多依赖点云作为观测输入，并通过点云特征学习构建场景表示，从而实现了卓越的精度。然而，当前文献缺乏对具有巨大潜力的纯视觉解决方案的深入探索。本文提出一种纯视觉单视角扩散策略学习方法（VO-DP），该方法利用预训练的视觉基础模型实现语义特征与几何特征的有效融合。我们采用VGGT的中间特征，结合DINOv2的语义特征与交替注意力块的几何特征。特征通过交叉注意力机制进行融合，并利用CNN进行空间压缩，形成策略头的输入。大量实验表明，VO-DP不仅显著优于纯视觉基线方法DP，而且相较于基于点云的方法DP3展现出独特的性能趋势：在仿真任务中，VO-DP平均成功率达到64.6%，与DP3的64.0%相当，远高于DP的34.8%；而在真实世界任务中，其成功率高达87.9%，明显优于DP3的67.5%和DP的11.2%。进一步的鲁棒性评估证实，VO-DP在颜色、尺寸、背景和光照等变化条件下仍保持高度稳定性。最后，我们开源了一个机器人操作训练库。该库基于Accelerate构建，支持多机多GPU并行训练及混合精度训练，兼容DP、DP3和VO-DP等视觉运动策略，并支持RoboTwin仿真器。