人机协同在线拒绝采样用于机器人操作 (Human-in-the-loop Online Rejection Sampling for Robotic Manipulation)

Reinforcement learning (RL) is widely used to produce robust robotic manipulation policies, but fine-tuning vision-language-action (VLA) models with RL can be unstable due to inaccurate value estimates and sparse supervision at intermediate steps. In contrast, imitation learning (IL) is easy to train but often underperforms due to its offline nature. In this paper, we propose Hi-ORS, a simple yet effective post-training method that utilizes rejection sampling to achieve both training stability and high robustness. Hi-ORS stabilizes value estimation by filtering out negatively rewarded samples during online fine-tuning, and adopts a reward-weighted supervised training objective to provide dense intermediate-step supervision. For systematic study, we develop an asynchronous inference-training framework that supports flexible online human-in-the-loop corrections, which serve as explicit guidance for learning error-recovery behaviors. Across three real-world tasks and two embodiments, Hi-ORS fine-tunes a pi-base policy to master contact-rich manipulation in just 1.5 hours of real-world training, outperforming RL and IL baselines by a substantial margin in both effectiveness and efficiency. Notably, the fine-tuned policy exhibits strong test-time scalability by reliably executing complex error-recovery behaviors to achieve better performance.

翻译：强化学习（RL）被广泛用于生成鲁棒的机器人操作策略，但基于视觉-语言-动作（VLA）模型的RL微调常因价值估计不准确和中间步骤监督稀疏而不稳定。相比之下，模仿学习（IL）易于训练，但因其离线性质常表现欠佳。本文提出Hi-ORS，一种简单而有效的后训练方法，利用拒绝采样同时实现训练稳定性和高鲁棒性。Hi-ORS通过在在线微调中过滤负奖励样本来稳定价值估计，并采用奖励加权的监督训练目标以提供密集的中间步骤监督。为系统研究，我们开发了支持灵活在线人机协同校正的异步推理-训练框架，这些校正为学习错误恢复行为提供了显式指导。在三个真实世界任务和两种实体平台上，Hi-ORS仅需1.5小时真实世界训练即可微调基础策略掌握接触密集型操作，在效果和效率上均大幅超越RL和IL基线方法。值得注意的是，微调后的策略展现出强大的测试时扩展能力，能可靠执行复杂的错误恢复行为以获得更优性能。