We study reinforcement learning (RL) with linear function approximation. For episodic time-inhomogeneous linear Markov decision processes (linear MDPs) whose transition dynamic can be parameterized as a linear function of a given feature mapping, we propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $\tilde O(d\sqrt{H^3K})$, where $d$ is the dimension of the feature mapping, $H$ is the planning horizon, and $K$ is the number of episodes. Our algorithm is based on a weighted linear regression scheme with a carefully designed weight, which depends on a new variance estimator that (1) directly estimates the variance of the \emph{optimal} value function, (2) monotonically decreases with respect to the number of episodes to ensure a better estimation accuracy, and (3) uses a rare-switching policy to update the value function estimator to control the complexity of the estimated value function class. Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
翻译:我们用线性函数近似值研究加固学习(RL) 。 对于可作为特定地貌映射线函数线性功能参数化的过渡动态可参数化为特定地貌映射线性函数的线性线性马尔科夫决定进程(线性 MDPs), 我们建议了第一个计算效率高的算法, 实现近乎微量最大最佳遗憾 $\tilde O(d\ sqrt{H}3K}) $(d\ sqrt{H}) (美元是地貌映射的维度) 。 $H$(美元) 是规划视野, $K$(美元) 是事件数量。 我们的算法是基于一个加权线性线性回归方案, 其重量经过仔细设计, 取决于新的差异估计符:(1) 直接估计 \emph{optiml} 值函数的差异, (2) 与事件数量相比单数的单数下降, 以确保更精确性, (3) 使用稀有的转换政策来更新值函数估计值值的精度, 以控制估计值的复杂度。 我们的工作为最优化的 RL的完整的完整的答案, 与线性 MDP 和理论工具可能是独立的兴趣。