Most of reinforcement learning algorithms optimize the discounted criterion which is beneficial to accelerate the convergence and reduce the variance of estimates. Although the discounted criterion is appropriate for certain tasks such as financial related problems, many engineering problems treat future rewards equally and prefer a long-run average criterion. In this paper, we study the reinforcement learning problem with the long-run average criterion. Firstly, we develop a unified trust region theory with discounted and average criteria. With the average criterion, a novel performance bound within the trust region is derived with the Perturbation Analysis (PA) theory. Secondly, we propose a practical algorithm named Average Policy Optimization (APO), which improves the value estimation with a novel technique named Average Value Constraint. To the best of our knowledge, our work is the first one to study the trust region approach with the average criterion and it complements the framework of reinforcement learning beyond the discounted criterion. Finally, experiments are conducted in the continuous control environment MuJoCo. In most tasks, APO performs better than the discounted PPO, which demonstrates the effectiveness of our approach.
翻译:大部分强化学习算法优化了有利于加速趋同和减少估计数差异的折扣标准。尽管折扣标准适用于金融相关问题等某些任务,但许多工程问题对未来奖赏一视同仁,更倾向于长期平均标准。在本文件中,我们用长期平均标准研究强化学习问题。首先,我们用折扣和平均标准开发了统一的信任区域理论;根据平均标准,根据渗透分析(PA)理论,在信任区域内产生新的业绩约束。第二,我们提出一个名为平均政策优化(APO)的实用算法,用名为“平均价值约束”的新技术改进价值估计。根据我们的最佳知识,我们的工作是用平均标准研究信任区域方法,它补充了超出折扣标准以外的强化学习框架。最后,实验是在持续控制环境中进行的。在多数任务中,APO的表现比折扣PO要好,这显示了我们方法的有效性。