This paper studies model-based bandit and reinforcement learning (RL) with nonlinear function approximations. We propose to study convergence to approximate local maxima because we show that global convergence is statistically intractable even for one-layer neural net bandit with a deterministic reward. For both nonlinear bandit and RL, the paper presents a model-based algorithm, Virtual Ascent with Online Model Learner (ViOL), which provably converges to a local maximum with sample complexity that only depends on the sequential Rademacher complexity of the model class. Our results imply novel global or local regret bounds on several concrete settings such as linear bandit with finite or sparse model class, and two-layer neural net bandit. A key algorithmic insight is that optimism may lead to over-exploration even for two-layer neural net model class. On the other hand, for convergence to local maxima, it suffices to maximize the virtual return if the model can also reasonably predict the size of the gradient and Hessian of the real return.
翻译:本文用非线性函数近似值研究以模型为基础的土匪和强化学习(RL) 。 我们提议研究趋同以近似本地最大值,因为我们显示,即使单层神经网土匪,全球趋同在统计上也是难以解决的。 对于非线性土匪和RL, 本文展示了一种基于模型的算法, 在线模型学习者虚拟进化( ViOL), 它与本地最大样本复杂度相近, 其样本复杂性仅取决于模型类的相继 Rademacher 复杂度。 我们的结果表明, 某些具体环境, 如有有限或稀疏模型类的线性土匪和两层神经网土匪, 具有新的全球或本地遗憾界限。 一个关键的算法见解是, 乐观可能导致过度探索, 甚至双层神经网模型类。 另一方面, 如果模型能够合理地预测真实回报的梯度和赫西安的大小, 它就足以使虚拟回报最大化。