Reducing reinforcement learning to supervised learning is a well-studied and effective approach that leverages the benefits of compact function approximation to deal with large-scale Markov decision processes. Independently, the boosting methodology (e.g. AdaBoost) has proven to be indispensable in designing efficient and accurate classification algorithms by combining inaccurate rules-of-thumb. In this paper, we take a further step: we reduce reinforcement learning to a sequence of weak learning problems. Since weak learners perform only marginally better than random guesses, such subroutines constitute a weaker assumption than the availability of an accurate supervised learning oracle. We prove that the sample complexity and running time bounds of the proposed method do not explicitly depend on the number of states. While existing results on boosting operate on convex losses, the value function over policies is non-convex. We show how to use a non-convex variant of the Frank-Wolfe method for boosting, that additionally improves upon the known sample complexity and running time even for reductions to supervised learning.
翻译:减少强化学习到监督学习是一种经过良好研究和有效的方法,它利用了紧凑功能近似的好处来应对大规模马尔科夫决策程序。 独立地说,提振方法(如AdaBoost)已证明是设计高效和准确的分类算法所不可或缺的,它结合了不准确的拖网规则。 在本文中,我们进一步迈出一步:我们把强化学习减少到学习不力的顺序。由于弱小学习者的表现比随机猜想略好一点,因此这种次例比提供准确的受监督的学习或触觉更弱。我们证明,拟议方法的抽样复杂性和运行时间范围并不明确取决于国家的数目。虽然提振螺旋损失的现有结果,但政策的价值功能是非Convex。我们展示了如何使用弗兰克-沃夫方法的非Convex变体来提振,从而进一步提高已知的样本复杂性,甚至缩短受监督学习的时间。