We study the problem of designing AI agents that can learn to cooperate effectively with a potentially suboptimal partner while having no access to the joint reward function. This problem is modeled as a cooperative episodic two-agent Markov decision process. We assume control over only the first of the two agents in a Stackelberg formulation of the game, where the second agent is acting so as to maximise expected utility given the first agent's policy. How should the first agent act in order to learn the joint reward function as quickly as possible, and so that the joint policy is as close to optimal as possible? In this paper, we analyse how knowledge about the reward function can be gained in this interactive two-agent scenario. We show that when the learning agent's policies have a significant effect on the transition function, the reward function can be learned efficiently.
翻译:我们研究设计AI代理商的问题,这些代理商可以学会如何在无法取得联合奖励功能的情况下与潜在不理想的合作伙伴进行有效合作,而同时又无法获得联合奖励功能。这个问题的模范是合作性分级双代理人马尔科夫决定程序。我们在Stackelberg的游戏配方中只控制了两个代理商中的第一个,第二代理商的行动是为了在第一个代理商的政策下最大限度地发挥预期的效用。第一代理商应如何行动,以便尽快学习联合奖励功能,使联合政策尽可能接近最佳状态?在本文件中,我们分析如何在互动的两代理商的情景中获得关于奖励功能的知识。我们表明,当学习代理商的政策对过渡功能产生重大影响时,可以有效地学习奖励功能。