We consider the problem of episodic reinforcement learning where there are multiple stakeholders with different reward functions. Our goal is to output a policy that is socially fair with respect to different reward functions. Prior works have proposed different objectives that a fair policy must optimize including minimum welfare, and generalized Gini welfare. We first take an axiomatic view of the problem, and propose four axioms that any such fair objective must satisfy. We show that the Nash social welfare is the unique objective that uniquely satisfies all four objectives, whereas prior objectives fail to satisfy all four axioms. We then consider the learning version of the problem where the underlying model i.e. Markov decision process is unknown. We consider the problem of minimizing regret with respect to the fair policies maximizing three different fair objectives -- minimum welfare, generalized Gini welfare, and Nash social welfare. Based on optimistic planning, we propose a generic learning algorithm and derive its regret bound with respect to the three different policies. For the objective of Nash social welfare, we also derive a lower bound in regret that grows exponentially with $n$, the number of agents. Finally, we show that for the objective of minimum welfare, one can improve regret by a factor of $O(H)$ for a weaker notion of regret.
翻译:我们考虑的是具有不同报酬功能的多重利害相关方的单体强化学习问题。我们的目标是在不同的报酬职能方面产生一种社会上公平的政策。以前的工作提出了不同的目标,即公平政策必须优化,包括最低福利和普遍吉尼福利。我们首先对问题采取不言自明的观点,提出任何此类公平目标必须满足的四条轴心。我们表明,纳什社会福利是独特的目标,唯一满足所有四项目标,而先前的目标未能满足所有四个轴心。然后,我们考虑的是基本模式,即马尔科夫决策过程未知的问题的学习版本。我们考虑的是,在公平政策最大限度地实现三个不同公平目标 -- -- 最低福利、普遍吉尼福利和纳什社会福利 -- -- 方面,最大限度地减少遗憾的问题。根据乐观的规划,我们提出一种通用的学习算法,并对三种不同政策表示遗憾。关于纳什社会福利的目标,我们也得出了一个较低的遗憾,即以美元指数增长,即代理人人数。最后,我们为最低福利目标展示了一个最弱的遗憾因素。