In complex tasks where the reward function is not straightforward and consists of a set of objectives, multiple reinforcement learning (RL) policies that perform task adequately, but employ different strategies can be trained by adjusting the impact of individual objectives on reward function. Understanding the differences in strategies between policies is necessary to enable users to choose between offered policies, and can help developers understand different behaviors that emerge from various reward functions and training hyperparameters in RL systems. In this work we compare behavior of two policies trained on the same task, but with different preferences in objectives. We propose a method for distinguishing between differences in behavior that stem from different abilities from those that are a consequence of opposing preferences of two RL agents. Furthermore, we use only data on preference-based differences in order to generate contrasting explanations about agents' preferences. Finally, we test and evaluate our approach on an autonomous driving task and compare the behavior of a safety-oriented policy and one that prefers speed.
翻译:在一些复杂的任务中,如果奖励职能不是直截了当的,而是由一组目标组成,那么,可以通过调整个别目标对奖励职能的影响来培训多种强化学习(RL)政策,但可以采用不同的战略。理解政策之间的战略差异是必要的,以便用户能够选择提供的政策,并且能够帮助开发者理解各种奖励职能产生的不同行为,并在RL系统中培训超分计。在这项工作中,我们比较了经过相同任务培训的两种政策的行为,但在目标上却有不同的偏好。我们建议了一种方法来区分由不同能力导致的行为与两个RL代理人的偏好所产生的不同行为之间的差异。此外,我们只使用基于优惠的差异的数据,以便对代理人的偏好作出对比解释。最后,我们测试和评价我们关于自主驾驶任务的方法,比较以安全为导向的政策的行为和倾向于速度的行为。