We consider a new form of reinforcement learning (RL) that is based on opportunities to directly learn the optimal control policy and a general Markov decision process (MDP) framework devised to support these opportunities. Derivations of general classes of our control-based RL methods are presented, together with forms of exploration and exploitation in learning and applying the optimal control policy over time. Our general MDP framework extends the classical Bellman operator and optimality criteria by generalizing the definition and scope of a policy for any given state. We establish the convergence and optimality-both in general and within various control paradigms (e.g., piecewise linear control policies)-of our control-based methods through this general MDP framework, including convergence of $Q$-learning within the context of our MDP framework. Our empirical results demonstrate and quantify the significant benefits of our approach.
翻译:我们考虑一种新的强化学习形式(RL),其基础是有机会直接学习最佳控制政策和为支持这些机会而设计的通用马尔科夫决策程序(MDP)框架,介绍了我们基于控制RL方法的一般类别,以及学习和适用最佳控制政策方面的探索和开发形式,我们的一般MDP框架通过概括任何特定国家的政策定义和范围,扩展了传统的贝尔曼操作员和最佳标准。我们通过这一总体的控制模式(例如,零星线性控制政策)和各种控制模式(例如,零星线性控制政策),确定了我们基于控制的方法的总体趋同和最佳性,包括通过我们的MDP框架内的 " Q " 学习趋同。我们的经验结果表明并量化了我们方法的重大效益。