In this paper, we study the well-known team orienteering problem where a fleet of robots collects rewards by visiting locations. Usually, the rewards are assumed to be known to the robots; however, in applications such as environmental monitoring or scene reconstruction, the rewards are often subjective and specifying them is challenging. We propose a framework to learn the unknown preferences of the user by presenting alternative solutions to them, and the user provides a ranking on the proposed alternative solutions. We consider the two cases for the user: 1) a deterministic user which provides the optimal ranking for the alternative solutions, and 2) a noisy user which provides the optimal ranking according to an unknown probability distribution. For the deterministic user we propose a framework to minimize a bound on the maximum deviation from the optimal solution, namely regret. We adapt the approach to capture the noisy user and minimize the expected regret. Finally, we demonstrate the importance of learning user preferences and the performance of the proposed methods in an extensive set of experimental results using real world datasets for environmental monitoring problems.
翻译:在本文中,我们研究了众所周知的团队定位问题,即一组机器人通过访问地点收集奖励。通常,奖励被假定为机器人所知晓;然而,在环境监测或现场重建等应用中,奖励往往是主观的,并且有挑战性。我们提出了一个框架,通过提供替代解决方案来了解用户的未知偏好,而用户则提供拟议替代解决方案的排名。我们认为用户的两个案例是:(1) 确定用户,为替代解决方案提供最佳排名;(2) 吵闹用户,根据未知概率分布提供最佳排名。对于确定用户,我们提出了一个框架,以尽量减少与最佳解决方案最大偏差的界限,即遗憾。我们调整了这一方法,以抓住噪音用户,并尽可能减少预期的遗憾。最后,我们用环境监测问题的真实世界数据集来学习用户的偏好和拟议方法在一系列广泛的实验结果中表现的重要性。