Q-learning is a popular reinforcement learning algorithm. This algorithm has however been studied and analysed mainly in the infinite horizon setting. There are several important applications which can be modeled in the framework of finite horizon Markov decision processes. We develop a version of Q-learning algorithm for finite horizon Markov decision processes (MDP) and provide a full proof of its stability and convergence. Our analysis of stability and convergence of finite horizon Q-learning is based entirely on the ordinary differential equations (O.D.E) method. We also demonstrate the performance of our algorithm on a setting of random MDP as well as on an application on smart grids.
翻译:Q-学习是一种受欢迎的强化学习算法,但这一算法主要是在无限的地平线环境中研究和分析的,在有限的地平线马尔科夫决定程序的框架内可以建模若干重要的应用程序。我们为有限的地平线马尔科夫决定程序开发了一套Q-学习算法(MDP),并充分证明了它的稳定性和趋同性。我们对有限地平线Q-学习的稳定性和趋同性的分析完全基于普通的差别方程(O.D.E)方法。我们还展示了我们算法在随机的MDP设置和智能电网应用方面的性能。