The classical theory of reinforcement learning (RL) has focused on tabular and linear representations of value functions. Further progress hinges on combining RL with modern function approximators such as kernel functions and deep neural networks, and indeed there have been many empirical successes that have exploited such combinations in large-scale applications. There are profound challenges, however, in developing a theory to support this enterprise, most notably the need to take into consideration the exploration-exploitation tradeoff at the core of RL in conjunction with the computational and statistical tradeoffs that arise in modern function-approximation-based learning systems. We approach these challenges by studying an optimistic modification of the least-squares value iteration algorithm, in the context of the action-value function represented by a kernel function or an overparameterized neural network. We establish both polynomial runtime complexity and polynomial sample complexity for this algorithm, without additional assumptions on the data-generating model. In particular, we prove that the algorithm incurs an $\tilde{\mathcal{O}}(\delta_{\mathcal{F}} H^2 \sqrt{T})$ regret, where $\delta_{\mathcal{F}}$ characterizes the intrinsic complexity of the function class $\mathcal{F}$, $H$ is the length of each episode, and $T$ is the total number of episodes. Our regret bounds are independent of the number of states, a result which exhibits clearly the benefit of function approximation in RL.
翻译:典型的强化学习理论(RL) 侧重于价值函数的表格式和线性表达式。 进一步的进展取决于将RL与现代功能相匹配的功能相匹配, 如内核函数和深神经网络, 并且确实在大规模应用中利用了这种组合, 取得了许多经验性的成功。 但是,在开发支持这一企业的理论时, 存在着巨大的挑战, 最明显的是, 需要将RL核心的勘探- 开发权衡与基于功能的学习系统中出现的计算和统计取舍结合起来。 我们通过研究对最小方值的值迭代算法进行乐观的修改, 在以内核函数或过于量化的神经网络为代表的行动值函数中。 我们为这一算法建立多运行时间复杂性和多位抽样复杂性, 而没有在数据生成模型上的额外假设。 特别是, 我们的算法产生一个$- mallxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx