GR-RL：面向长时程机器人灵巧与精密操作 (GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation)

Yunfei Li,Xiao Ma,Jiafeng Xu,Yu Cui,Zhongren Cui,Zhigang Han,Liqun Huang,Tao Kong,Yuxiao Liu,Hao Niu,Wanli Peng,Jingchao Qiao,Zeyu Ren,Haixin Shi,Zhi Su,Jiawen Tian,Yuyang Xiao,Shenyu Zhang,Liwei Zheng,Hang Li,Yonghui Wu

We present GR-RL, a robotic learning framework that turns a generalist vision-language-action (VLA) policy into a highly capable specialist for long-horizon dexterous manipulation. Assuming the optimality of human demonstrations is core to existing VLA policies. However, we claim that in highly dexterous and precise manipulation tasks, human demonstrations are noisy and suboptimal. GR-RL proposes a multi-stage training pipeline that filters, augments, and reinforces the demonstrations by reinforcement learning. First, GR-RL learns a vision-language-conditioned task progress, filters the demonstration trajectories, and only keeps the transitions that contribute positively to the progress. Specifically, we show that by directly applying offline RL with sparse reward, the resulting $Q$-values can be treated as a robust progress function. Next, we introduce morphological symmetry augmentation that greatly improves the generalization and performance of GR-RL. Lastly, to better align the VLA policy with its deployment behaviors for high-precision control, we perform online RL by learning a latent space noise predictor. With this pipeline, GR-RL is, to our knowledge, the first learning-based policy that can autonomously lace up a shoe by threading shoelaces through multiple eyelets with an 83.3% success rate, a task requiring long-horizon reasoning, millimeter-level precision, and compliant soft-body interaction. We hope GR-RL provides a step toward enabling generalist robot foundation models to specialize into reliable real-world experts.

翻译：本文提出GR-RL——一种机器人学习框架，能够将通用视觉-语言-动作策略转化为擅长长时程灵巧操作的专用策略。现有视觉-语言-动作策略的核心假设是人类演示具有最优性，但我们指出在高度灵巧且精密的操作任务中，人类演示存在噪声且并非最优。GR-RL提出一种多阶段训练流程，通过强化学习对演示数据进行筛选、增强与强化。首先，GR-RL学习视觉-语言条件化的任务进度函数，对演示轨迹进行过滤，仅保留对进度有积极贡献的状态转移。具体而言，我们证明直接应用稀疏奖励的离线强化学习所得$Q$值可作为鲁棒的进度函数。其次，我们引入形态对称性增强方法，显著提升了GR-RL的泛化能力与性能。最后，为使视觉-语言-动作策略在高精度控制中与部署行为更好对齐，我们通过隐空间噪声预测器进行在线强化学习。通过该流程，GR-RL成为首个（据我们所知）能够自主穿鞋带的学习策略，其成功率达83.3%——该任务需要长时程推理、毫米级精度及顺应性软体交互。我们希望GR-RL能为通用机器人基础模型向可靠现实世界专家转化提供前进方向。