We make progress in a long-standing problem of batch reinforcement learning (RL): learning $Q^\star$ from an exploratory and polynomial-sized dataset, using a realizable and otherwise arbitrary function class. In fact, all existing algorithms demand function-approximation assumptions stronger than realizability, and the mounting negative evidence has led to a conjecture that sample-efficient learning is impossible in this setting (Chen and Jiang, 2019). Our algorithm, BVFT, breaks the hardness conjecture (albeit under a stronger notion of exploratory data) via a tournament procedure that reduces the learning problem to pairwise comparison, and solves the latter with the help of a state-action partition constructed from the compared functions. We also discuss how BVFT can be applied to model selection among other extensions and open problems.
翻译:我们在一个长期存在的批量强化学习(RL)问题上取得了进展:利用一个可实现的、其他任意的功能类,从一个探索性和多元规模的数据集中学习 $star $star $$ 。 事实上,所有现有的算法都要求功能匹配假设比真实性强,而越来越多的负面证据导致一种推测,即在这一背景下不可能进行抽样高效学习(Chen和Jiang, 2019年)。我们的算法,BVFT,通过一个将学习问题减少到对比的竞技程序打破了硬性猜想(尽管探索性数据的概念更强 ), 并且借助从比较函数中构建的国家行动分割来解决这一问题。 我们还讨论了BVFT如何在其它扩展和公开问题中应用模型选择。