以渐进式反强化学习方式规划人与人之间的偏好 (An Incremental Inverse Reinforcement Learning Approach for Motion Planning with Human Preferences)

Humans often demonstrate diverse behaviors due to their personal preferences, for instance related to their individual execution style or personal margin for safety. In this paper, we consider the problem of integrating such preferences into trajectory planning for robotic manipulators. We first learn reward functions that represent the user path and motion preferences from kinesthetic demonstration. We then use a discrete-time trajectory optimization scheme to produce trajectories that adhere to both task requirements and user preferences. We go beyond the state of art by designing a feature set that captures the fundamental preferences in a manipulation task, such as timing of the motion. We further demonstrate that our method is capable of generalizing such preferences to new scenarios. We implement our algorithm on a Franka Emika 7-DoF robot arm, and validate the functionality and flexibility of our approach in a user study. The results show that non-expert users are able to teach the robot their preferences with just a few iterations of feedback.

翻译：人类往往因其个人偏好而表现出不同的行为,例如个人执行方式或个人安全空间。在本文中,我们考虑了将这种偏好纳入机器人操控者轨迹规划的问题。我们首先从动画演示中学习代表用户路径和运动偏好的奖励功能。然后我们使用一个离散时间轨迹优化计划来制作既符合任务要求又符合用户偏好的轨迹。我们设计了一个功能集,在操纵任务中捕捉基本偏好,例如运动的时间等。我们进一步证明我们的方法能够将这种偏好推广到新的场景中。我们用Franka Emika 7-DoF机器人臂执行我们的算法,并在用户研究中验证我们方法的功能和灵活性。结果显示,非专家用户能够用少数的反馈来教导机器人他们的偏好。