Reinforcement Learning (RL) algorithms such as DQN owe their success to Markov Decision Processes, and the fact that maximizing the sum of rewards allows using backward induction and reduce to the Bellman optimality equation. However, many real-world problems require optimization of an objective that is non-linear in cumulative rewards for which dynamic programming cannot be applied directly. For example, in a resource allocation problem, one of the objectives is to maximize long-term fairness among the users. We notice that when the function of the sum of rewards is considered, the problem loses its Markov nature. This paper addresses and formalizes the problem of optimizing a non-linear function of the long term average of rewards. We propose model-based and model-free algorithms to learn the policy, where the model-based policy is shown to achieve a regret of $\Tilde{O}\left(KDSA\sqrt{\frac{A}{T}}\right)$ for $K$ users. Further, using the fairness in cellular base-station scheduling, and queueing system scheduling as examples, the proposed algorithm is shown to significantly outperform the conventional RL approaches.
翻译:强化学习(RL)算法(如DQN)的成功应归功于Markov 决策程序,而最大的奖赏总和允许使用后向感应和降低到贝尔曼最佳等式。然而,许多现实世界问题要求优化一个非线性累积奖赏目标,因为不能直接应用动态程序。例如,在一个资源分配问题中,目标之一是最大限度地提高用户之间的长期公平性。我们注意到,当考虑奖赏总和的功能时,问题就失去了Markov的性质。本文将优化长期奖赏平均的非线性功能的问题正式化。我们提出了基于模型的和无模型的算法,以学习政策,而基于模型的政策显示对$\Tilde{O ⁇ left (KDSA\sqrtrt\frac{A ⁇ T ⁇ right) 为美元用户争取到$的遗憾。此外,利用蜂窝基地调度的公平性,以及排队系统列表作为示例,拟议的算法大大超出了常规RL方法。