We study fair multi-objective reinforcement learning in which an agent must learn a policy that simultaneously achieves high reward on multiple dimensions of a vector-valued reward. Motivated by the fair resource allocation literature, we model this as an expected welfare maximization problem, for some non-linear fair welfare function of the vector of long-term cumulative rewards. One canonical example of such a function is the Nash Social Welfare, or geometric mean, the log transform of which is also known as the Proportional Fairness objective. We show that even approximately optimal optimization of the expected Nash Social Welfare is computationally intractable even in the tabular case. Nevertheless, we provide a novel adaptation of Q-learning that combines non-linear scalarized learning updates and non-stationary action selection to learn effective policies for optimizing nonlinear welfare functions. We show that our algorithm is provably convergent, and we demonstrate experimentally that our approach outperforms techniques based on linear scalarization, mixtures of optimal linear scalarizations, or stationary action selection for the Nash Social Welfare Objective.
翻译:我们研究的是公平、多目标强化学习,在这种学习中,代理人必须学习一项同时在矢量价值奖励的多个方面获得高额奖励的政策。受公平资源分配文献的激励,我们将此模型作为预期福利最大化问题,用于长期累积奖励矢量的某些非线性公平福利功能。这种功能的一个典型例子是纳什社会福利,或几何平均值,其日志转换也被称为比例公平目标。我们显示,即使是在表格中,即使预期的纳什社会福利的大约最佳优化也是难以计算。然而,我们提供了对Q学习的新型调整,将非线性升级学习更新和非静态行动选择结合起来,学习优化非线性福利功能的有效政策。我们表明,我们的算法是相当一致的,我们实验性地表明,我们的方法超越了基于线性定量、最佳线性线性定量的混合物或纳什社会福利目标的固定行动选择的技术。