We study efficient algorithms for reinforcement learning in Markov decision processes whose complexity is independent of the number of states. This formulation succinctly captures large scale problems, but is also known to be computationally hard in its general form. Previous approaches attempt to circumvent the computational hardness by assuming structure in either transition function or the value function, or by relaxing the solution guarantee to a local optimality condition. We consider the methodology of boosting, borrowed from supervised learning, for converting weak learners into an accurate policy. The notion of weak learning we study is that of sampled-based approximate optimization of linear functions over policies. Under this assumption of weak learnability, we give an efficient algorithm that is capable of improving the accuracy of such weak learning methods, till global optimality is reached. We prove sample complexity and running time bounds on our method, that are polynomial in the natural parameters of the problem: approximation guarantee, discount factor, distribution mismatch and number of actions. In particular, our bound does not depend on the number of states. A technical difficulty in applying previous boosting results, is that the value function over policy space is not convex. We show how to use a non-convex variant of the Frank-Wolfe method, coupled with recent advances in gradient boosting that allow incorporating a weak learner with multiplicative approximation guarantee, to overcome the non-convexity and attain global convergence.
翻译:我们研究在与数目不同的马尔科夫决策进程中强化学习的有效算法,其复杂性独立于国家数目。这一提法简明扼要地捕捉了大规模的问题,但也以其一般形式在计算上很困难。以前的做法试图绕过计算硬性,办法是假设过渡功能或价值功能的结构,或放宽解决方案保证地方最佳性的条件。我们考虑从监督的学习中借用的提振方法,将弱学习者转化为准确的政策。我们研究的薄弱学习概念是,在抽样的基础上优化线性功能,而不是政策。在这种学习能力薄弱的假设下,我们给出一种高效的算法,能够提高这种薄弱学习方法的准确性,直到全球达到最佳性。我们证明抽样复杂性和运行我们方法上的时间界限,这在问题的自然参数中是多元性的:近似担保、贴现因素、分配不匹配和行动的数量。我们研究的界限并不取决于国家的数目。我们研究的弱学习概念是,在应用先前的提振结果方面的技术困难是,在政策空间上的价值功能不是相互交错的,我们在达到全球最佳的学习方法之前,我们展示了如何将升级的升级到最新的升级,从而获得新的升级方法。我们如何将逐渐地将逐渐地进行递增压。