Deep reinforcement learning has achieved impressive successes yet often requires a very large amount of interaction data. This result is perhaps unsurprising, as using complicated function approximation often requires more data to fit, and early theoretical results on linear Markov decision processes provide regret bounds that scale with the dimension of the linear approximation. Ideally, we would like to automatically identify the minimal dimension of the approximation that is sufficient to encode an optimal policy. Towards this end, we consider the problem of model selection in RL with function approximation, given a set of candidate RL algorithms with known regret guarantees. The learner's goal is to adapt to the complexity of the optimal algorithm without knowing it \textit{a priori}. We present a meta-algorithm that successively rejects increasingly complex models using a simple statistical test. Given at least one candidate that satisfies realizability, we prove the meta-algorithm adapts to the optimal complexity with $\tilde{O}(L^{5/6} T^{2/3})$ regret compared to the optimal candidate's $\tilde{O}(\sqrt T)$ regret, where $T$ is the number of episodes and $L$ is the number of algorithms. The dimension and horizon dependencies remain optimal with respect to the best candidate, and our meta-algorithmic approach is flexible to incorporate multiple candidate algorithms and models. Finally, we show that the meta-algorithm automatically admits significantly improved instance-dependent regret bounds that depend on the gaps between the maximal values attainable by the candidates.
翻译:深加学习已经取得了令人印象深刻的成功,但往往需要大量互动数据。这一结果也许并不令人惊讶,因为使用复杂的功能近似往往需要更多的数据才能适应,而线性马尔科夫决策程序的早期理论结果提供了与线性近似尺寸相比规模的遗憾界限。 理想的是我们想要自动确定近似最小的维度,这足以为最佳政策编码。 为达到这一目的,我们考虑在RL中以功能近似方式选择模式的问题,因为有一套候选人RL算法,并有已知的遗憾保证。 学习者的目标是适应最佳算法的复杂程度,而不必知道它是否自动( textitit{ a sisteri} 。 我们展示的元和复杂模型的元比重,我们至少一个符合真实性的候选人,我们证明了元和值适应最复杂的复杂程度。 相对于最佳候选人的正值来说, 最优的运算法和最优的运算法仍然是最优的, 以美元和最优的运算法的形式显示我们最优的运价的比。