We study the problem of multiple agents learning concurrently in a multi-objective environment. Specifically, we consider two agents that repeatedly play a multi-objective normal-form game. In such games, the payoffs resulting from joint actions are vector valued. Taking a utility-based approach, we assume a utility function exists that maps vectors to scalar utilities and consider agents that aim to maximise the utility of expected payoff vectors. As agents do not necessarily know their opponent's utility function or strategy, they must learn optimal policies to interact with each other. To aid agents in arriving at adequate solutions, we introduce four novel preference communication protocols for both cooperative as well as self-interested communication. Each approach describes a specific protocol for one agent communicating preferences over their actions and how another agent responds. These protocols are subsequently evaluated on a set of five benchmark games against baseline agents that do not communicate. We find that preference communication can drastically alter the learning process and lead to the emergence of cyclic Nash equilibria which had not been previously observed in this setting. Additionally, we introduce a communication scheme where agents must learn when to communicate. For agents in games with Nash equilibria, we find that communication can be beneficial but difficult to learn when agents have different preferred equilibria. When this is not the case, agents become indifferent to communication. In games without Nash equilibria, our results show differences across learning rates. When using faster learners, we observe that explicit communication becomes more prevalent at around 50% of the time, as it helps them in learning a compromise joint policy. Slower learners retain this pattern to a lesser degree, but show increased indifference.
翻译:我们研究多个代理商在多目标环境中同时学习的问题。 具体地说, 我们考虑两个代理商在多目标环境中反复玩多目标正常形式游戏。 在这样的游戏中, 联合行动的回报是矢量值。 采用基于公用方法, 我们假设一个实用功能存在, 将矢量映射为天平水电, 并考虑旨在最大限度地发挥预期报酬矢量的效用的代理商。 由于代理商不一定知道对方的实用功能或战略, 他们必须学习最佳政策, 才能相互互动。 为了帮助代理商达成适当的解决方案, 我们为合作和自我感兴趣的通信引入四种新的优先通信协议。 每一种方法都描述一个代理商交流其行动的偏好和另一个代理商的反应方式的具体协议。 我们假设, 以一套五种基准游戏来对照没有沟通的基线代理商来评估这些协议。 我们发现, 偏爱度通信可以大大改变学习过程, 导致在此背景下没有观察到的周期性 Nash 平衡的出现。 此外, 我们引入了一个通信计划, 代理商必须学习何时沟通。 在与纳克平原则的游戏中, 选择了一种不固定的周期性游戏中, 我们发现, 学习方式的代理商更难, 学习了一种有利于的代理商 学习方式是学习 学习 学习 学习 学习 学习 学习 学习 学习 。 学习 学习 。 在不 学习 学习 学习 学习 学习 学习 学习 学习 学习 学习 学习 学习 学习 学习