Markov决策程序最佳最低限度政策在线强化学习 (Online Reinforcement Learning of Optimal Threshold Policies for Markov Decision Processes)

Markov Decision Process (MDP) problems can be solved using Dynamic Programming (DP) methods which suffer from the curse of dimensionality and the curse of modeling. To overcome these issues, Reinforcement Learning (RL) methods are adopted in practice. In this paper, we aim to obtain the optimal admission control policy in a system where different classes of customers are present. Using DP techniques, we prove that it is optimal to admit the $i$ th class of customers only upto a threshold $\tau(i)$ which is a non-increasing function of $i$. Contrary to traditional RL algorithms which do not take into account the structural properties of the optimal policy while learning, we propose a structure-aware learning algorithm which exploits the threshold structure of the optimal policy. We prove the asymptotic convergence of the proposed algorithm to the optimal policy. Due to the reduction in the policy space, the structure-aware learning algorithm provides remarkable improvements in storage and computational complexities over classical RL algorithms. Simulation results also establish the gain in the convergence rate of the proposed algorithm over other RL algorithms. The techniques presented in the paper can be applied to any general MDP problem covering various applications such as inventory management, financial planning and communication networking.

翻译：Markov 决策程序(MDP) 问题可以通过动态程序(DP) 解决,因为动态程序(DP) 方法受到维度的诅咒和建模的诅咒。为了克服这些问题,我们在实践中采用了强化学习(RL)方法。在本文中,我们的目标是在一个有不同类别客户的系统中获得最佳的入门控制政策。使用DP技术,我们证明最理想的做法是接纳第几类客户,但只能达到一个最高门槛$\tau(i)美元,这是不增加的美元功能。与传统的RL算法相反,这种算法在学习时没有考虑到最佳政策的结构特性。我们建议采用一种结构认知学习算法,利用最佳政策的起点结构结构。我们证明拟议的算法与最佳政策不完全一致。由于政策空间的缩小,结构认知算法在储存和计算复杂性方面比经典RL算法的功能有了显著的改进。模拟结果还确定了拟议的RL算法相对于其他RL算法的合并率,我们建议采用结构-觉悟学习算法的算法,我们证明拟议的算法中的各种技术可以适用于各种金融管理。