Reward functions are notoriously difficult to specify, especially for tasks with complex goals. Reward learning approaches attempt to infer reward functions from human feedback and preferences. Prior works on reward learning have mainly focused on the performance of policies trained alongside the reward function. This practice, however, may fail to detect learned rewards that are not capable of training new policies from scratch and thus do not capture the intended behavior. Our work focuses on demonstrating and studying the causes of these relearning failures in the domain of preference-based reward learning. We demonstrate with experiments in tabular and continuous control environments that the severity of relearning failures can be sensitive to changes in reward model design and the trajectory dataset composition. Based on our findings, we emphasize the need for more retraining-based evaluations in the literature.
翻译:奖励性学习方法试图从人类的反馈和偏好中推断奖励性功能; 以往的奖励性学习工作主要侧重于与奖励性功能同时培训的政策的绩效; 但是,这种做法可能无法发现无法从零开始培训新政策从而无法捕捉预期行为的学习性奖励; 我们的工作重点是在基于优惠的奖励学习领域示范和研究这些再学习失败的原因; 我们在表格和连续的控制环境中进行实验,证明再学习失败的严重程度可能对奖赏模式设计和轨迹数据集构成的变化十分敏感。 根据我们的调查结果,我们强调文献中需要更多基于再培训的评价。