Preference Based Reinforcement Learning has shown much promise for utilizing human binary feedback on queried trajectory pairs to recover the underlying reward model of the Human in the Loop (HiL). While works have attempted to better utilize the queries made to the human, in this work we make two observations about the unlabeled trajectories collected by the agent and propose two corresponding loss functions that ensure participation of unlabeled trajectories in the reward learning process, and structure the embedding space of the reward model such that it reflects the structure of state space with respect to action distances. We validate the proposed method on one locomotion domain and one robotic manipulation task and compare with the state-of-the-art baseline PEBBLE. We further present an ablation of the proposed loss components across both the domains and find that not only each of the loss components perform better than the baseline, but the synergic combination of the two has much better reward recovery and human feedback sample efficiency.
翻译:以强化为主的强化学习显示,在利用关于有疑问的轨迹配对的人类二进制反馈以恢复人类在环形(HIL)中的基本奖赏模式方面,人类的二进制反馈大有希望。 虽然工作试图更好地利用对人的质询,但我们在这项工作中就代理人收集的未贴标签的轨迹提出了两点意见,并提出了两个相应的损失功能,确保未贴标签的轨迹参与奖赏学习过程,并构建奖赏模式的嵌入空间,使其反映国家空间在行动距离方面的结构。我们验证了一个移动域和一个机器人操纵任务的拟议方法,并与最新的基线PEBBBLB任务进行了比较。我们进一步介绍了两个领域拟议损失组成部分的折叠情况,发现不仅每个损失组成部分的功能都比基线要好,而且两者的协同组合对恢复和人的反馈抽样效率效果要好得多。