Finding optimal policies which maximize long term rewards of Markov Decision Processes requires the use of dynamic programming and backward induction to solve the Bellman optimality equation. However, many real-world problems require optimization of an objective that is non-linear in cumulative rewards for which dynamic programming cannot be applied directly. For example, in a resource allocation problem, one of the objectives is to maximize long-term fairness among the users. We notice that when an agent aim to optimize some function of the sum of rewards is considered, the problem loses its Markov nature. This paper addresses and formalizes the problem of optimizing a non-linear function of the long term average of rewards. We propose model-based and model-free algorithms to learn the policy, where the model-based policy is shown to achieve a regret of $\Tilde{O}\left(LKDS\sqrt{\frac{A}{T}}\right)$ for $K$ objectives combined with a concave $L$-Lipschitz function. Further, using the fairness in cellular base-station scheduling, and queueing system scheduling as examples, the proposed algorithm is shown to significantly outperform the conventional RL approaches.
翻译:找到最佳政策,使Markov 决策进程的长期回报最大化。 然而,许多现实世界问题需要优化一个非线性累积回报中非线性且无法直接应用动态程序的目标。 例如,在资源分配问题中,目标之一是最大限度地在用户中实现长期公平。 我们注意到,当考虑一个代理机构旨在优化奖励总额的某些功能时,问题就失去了其马尔科夫的性质。 本文将优化长期平均奖励的非线性功能的问题正式化。 我们提出基于模型的和不使用模型的算法来学习政策,而基于模型的政策显示,对于以美元为单位的目标实现$-Tilde{O ⁇ left(LKDS\sqrt\qrt_frac{A ⁇ _T ⁇ right)的遗憾,再加上一个Concave $- $L$-Lipschitz 功能。 此外,利用蜂窝基地调度的公正性和排队列系统列表作为示例,拟议的算出拟议的算出常规方法。