We consider an online learning problem where the learner interacts with a Markov decision process in a sequence of episodes, where the reward function is allowed to change between episodes in an adversarial manner and the learner only gets to observe the rewards associated with its actions. We allow the state space to be arbitrarily large, but we assume that all action-value functions can be represented as linear functions in terms of a known low-dimensional feature map, and that the learner has access to a simulator of the environment that allows generating trajectories from the true MDP dynamics. Our main contribution is developing a computationally efficient algorithm that we call MDP-LinExp3, and prove that its regret is bounded by $\widetilde{\mathcal{O}}\big(H^2 T^{2/3} (dK)^{1/3}\big)$, where $T$ is the number of episodes, $H$ is the number of steps in each episode, $K$ is the number of actions, and $d$ is the dimension of the feature map. We also show that the regret can be improved to $\widetilde{\mathcal{O}}\big(H^2 \sqrt{TdK}\big)$ under much stronger assumptions on the MDP dynamics. To our knowledge, MDP-LinExp3 is the first provably efficient algorithm for this problem setting.
翻译:我们认为,当学习者在一系列事件中与Markov决定程序发生互动时,当学习者在一系列事件中与Markov决定程序发生互动时,当奖励功能被允许以对抗的方式在事件之间发生改变时,学习者只能观察与其行动有关的奖励。我们允许国家空间任意地大,但我们假设所有行动价值功能都可以以已知的低维特征地图作为线性功能,而学习者可以使用环境模拟器,从而能够从真正的 MDP 动态中产生轨迹。我们的主要贡献是开发一种计算高效的算法,我们称之为 MDP-LinExplor3, 并证明它遗憾受 $\ plende\ mathcal{O ⁇ 2/2} (dK)\%1/3 ⁇ big 美元(d) 的束缚, $T是事件的数量, $H$是每集步骤的数量, 美元是行动次数, $d$是地段地图的层面。我们还表明,在M-K&QQQ_Q_Qrmat_Q_QQ_maqal_madromas roup roup roup roupto roup roup_totototo romax roup rouptototototototototo rogrogromaxm_totom_tototommm_tom_blegilm_bleg) rod rod romam romam rogal_ rodrodrod rod rod rocilmism rod rod rod ro)