评分的多式奖励 (Learning Multimodal Rewards from Rankings)

Learning from human feedback has shown to be a useful approach in acquiring robot reward functions. However, expert feedback is often assumed to be drawn from an underlying unimodal reward function. This assumption does not always hold including in settings where multiple experts provide data or when a single expert provides data for different tasks -- we thus go beyond learning a unimodal reward and focus on learning a multimodal reward function. We formulate the multimodal reward learning as a mixture learning problem and develop a novel ranking-based learning approach, where the experts are only required to rank a given set of trajectories. Furthermore, as access to interaction data is often expensive in robotics, we develop an active querying approach to accelerate the learning process. We conduct experiments and user studies using a multi-task variant of OpenAI's LunarLander and a real Fetch robot, where we collect data from multiple users with different preferences. The results suggest that our approach can efficiently learn multimodal reward functions, and improve data-efficiency over benchmark methods that we adapt to our learning problem.

翻译：从人类反馈中学习人类的反馈证明是获得机器人奖赏功能的有用方法,然而,专家的反馈往往被假定是从一个基本的单一方式奖赏功能中得出的,这一假设并非总能包括多专家提供数据或一位专家为不同任务提供数据的环境下 -- -- 因此,我们不仅学习一种单一方式奖赏,而且注重学习一种多式联运奖赏功能;我们把多式奖赏学习作为一种混合学习问题,并发展一种新型的、基于等级的学习方法,即专家只需对一组特定的轨迹进行排序。此外,由于互动数据的获取在机器人中往往费用昂贵,我们开发一种积极的查询方法来加速学习过程。我们使用OpenAI的LunarLander和真正的Petch机器人的多式变体变体进行实验和用户研究,我们从那里收集来自不同偏好多个用户的数据。结果表明,我们的方法可以有效地学习多式奖赏功能,并改进数据效率,超越我们适应学习问题的基准方法。