Many real-world reinforcement learning tasks require control of complex dynamical systems that involve both costly data acquisition processes and large state spaces. In cases where the transition dynamics can be readily evaluated at specified states (e.g., via a simulator), agents can operate in what is often referred to as planning with a \emph{generative model}. We propose the AE-LSVI algorithm for best-policy identification, a novel variant of the kernelized least-squares value iteration (LSVI) algorithm that combines optimism with pessimism for active exploration (AE). AE-LSVI provably identifies a near-optimal policy \emph{uniformly} over an entire state space and achieves polynomial sample complexity guarantees that are independent of the number of states. When specialized to the recently introduced offline contextual Bayesian optimization setting, our algorithm achieves improved sample complexity bounds. Experimentally, we demonstrate that AE-LSVI outperforms other RL algorithms in a variety of environments when robustness to the initial state is required.
翻译:许多实际世界强化学习任务要求控制复杂的动态系统,这些系统既涉及昂贵的数据获取过程,也涉及大型国家空间。如果过渡动态可以在特定国家(例如通过模拟器)进行即时评估,代理商可以在通常称为规划的地方使用\emph{generation 模型进行操作。我们建议使用AE-LSVI算法进行最佳政策识别,这是将乐观与悲观进行积极探索(AE)相结合的最小方位迭代值(LSVI)算法的新型变体。AE-LSVI可以很容易地确定整个州空间的近最佳政策 \emph{unformatly} 并实现独立于州数的多元样本复杂性保证。当我们专门使用最近推出的离线背景海湾优化设置时,我们的算法实现了更好的样本复杂性界限。我们实验性地证明,在需要稳健到初始状态时,AE-LSVI在各种环境中优于其他RL算法。