This paper studies model-based bandit and reinforcement learning (RL) with nonlinear function approximations. We propose to study convergence to approximate local maxima because we show that global convergence is statistically intractable even for one-layer neural net bandit with a deterministic reward. For both nonlinear bandit and RL, the paper presents a model-based algorithm, Virtual Ascent with Online Model Learner (ViOlin), which provably converges to a local maximum with sample complexity that only depends on the sequential Rademacher complexity of the model class. Our results imply novel global or local regret bounds on several concrete settings such as linear bandit with finite or sparse model class, and two-layer neural net bandit. A key algorithmic insight is that optimism may lead to over-exploration even for two-layer neural net model class. On the other hand, for convergence to local maxima, it suffices to maximize the virtual return if the model can also reasonably predict the size of the gradient and Hessian of the real return.
翻译:本文用非线性函数近似值研究以模型为基础的土匪和强化学习(RL) 。 我们提议研究趋同以近似本地最大值, 因为我们显示,即使一层神经网土匪,全球趋同在统计上也是难以解决的, 并带有确定性的奖赏。 对于非线性土匪和RL, 本文展示了一种基于模型的算法, 即在线模型学习者虚拟进化( ViOlin ), 该算法与本地最大样本复杂度相近, 且只取决于模型类的相继 Rademacher 复杂度。 我们的结果意味着, 新的全球或本地对若干具体环境的遗憾界限, 比如, 有有限或稀疏模型类的线性土匪和两层神经网土匪。 一个关键的算学洞洞洞察是, 乐观主义可能导致过度探索, 甚至双层神经网模型类。 另一方面, 如果模型能合理地预测真实回报的梯度和赫西人, 则足以实现虚拟回报的最大化。