Generating human-like behavior on robots is a great challenge especially in dexterous manipulation tasks with robotic hands. Even in simulation with no sample constraints, scripting controllers is intractable due to high degrees of freedom, and manual reward engineering can also be hard and lead to non-realistic motions. Leveraging the recent progress on Reinforcement Learning from Human Feedback (RLHF), we propose a framework to learn a universal human prior using direct human preference feedback over videos, for efficiently tuning the RL policy on 20 dual-hand robot manipulation tasks in simulation, without a single human demonstration. One task-agnostic reward model is trained through iteratively generating diverse polices and collecting human preference over the trajectories; it is then applied for regularizing the behavior of polices in the fine-tuning stage. Our method empirically demonstrates more human-like behaviors on robot hands in diverse tasks including even unseen tasks, indicating its generalization capability.
翻译:在机器人置换操作任务中生成类似于人类的行为是巨大的挑战, 特别是涉及到机器人手的灵巧操作。即使在没有样本约束的模拟中, 由于自由度高, 编写控制器也是不可行的, 手动设计奖励也可能难以实现并导致非现实的动作。借助强化学习从人类反馈中的最近进展 (RLHF), 我们提出一个框架,使用直接的人类偏好反馈视频数据来学习通用的人类先验知识,为20个双手机器人置换操作任务在模拟中高效调整RL策略, 并且不需要进行单个人类演示。通过迭代生成不同的策略和收集人类对轨迹的偏好, 训练了一个任务不可知奖励模型, 然后将其应用于在精细调节阶段规范策略的行为。我们的方法在多种任务中, 包括未见过的任务中, 在机器人手上呈现了更类似于人类的行为, 表明它具有良好的泛化能力。