We study the problem of deployment efficient reinforcement learning (RL) with linear function approximation under the \emph{reward-free} exploration setting. This is a well-motivated problem because deploying new policies is costly in real-life RL applications. Under the linear MDP setting with feature dimension $d$ and planning horizon $H$, we propose a new algorithm that collects at most $\widetilde{O}(\frac{d^2H^5}{\epsilon^2})$ trajectories within $H$ deployments to identify $\epsilon$-optimal policy for any (possibly data-dependent) choice of reward functions. To the best of our knowledge, our approach is the first to achieve optimal deployment complexity and optimal $d$ dependence in sample complexity at the same time, even if the reward is known ahead of time. Our novel techniques include an exploration-preserving policy discretization and a generalized G-optimal experiment design, which could be of independent interest. Lastly, we analyze the related problem of regret minimization in low-adaptive RL and provide information-theoretic lower bounds for switching cost and batch complexity.
翻译:我们研究了部署高效增强学习(RL)的问题,在 emph{ reward-fred-fred} 勘探设置下的线性功能近似值(RL) 。 这是一个动机良好的问题,因为采用新政策在实际的RL应用中代价高昂。 在具有地谱层面的线性 MDP 设置和规划地平线$H$(H$) 下,我们建议一种新的算法,最多收集$(全方位){(frac{d}2H%2H%5-hun-hepsilon2}) 美元(H$(美元)的部署范围内的轨迹,以确定任何(可能依赖数据)奖赏功能的选择的美元最佳政策。 根据我们的知识,我们的方法是首先实现最佳的部署复杂性,同时在样本复杂度方面实现最佳的美元依赖度(即使奖励早为人所知 ) 。 我们的新技术包括勘探保留政策离散和通用的G-opylus实验设计,这可能具有独立的兴趣。 最后,我们分析了低调的RL 和低调调成本和低压度变装复杂度的信息。