We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs): at the beginning of every episode nature randomly picks a latent reward model among $M$ candidates and an agent interacts with the MDP throughout the episode for $H$ time steps. Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model. Previous work established an upper bound for RMMDPs for $M=2$. In this work, we resolve several open questions remained for the RMMDP model. For an arbitrary $M\ge2$, we provide a sample-efficient algorithm--$\texttt{EM}^2$--that outputs an $\epsilon$-optimal policy using $\tilde{O} \left(\epsilon^{-2} \cdot S^d A^d \cdot \texttt{poly}(H, Z)^d \right)$ episodes, where $S, A$ are the number of states and actions respectively, $H$ is the time-horizon, $Z$ is the support size of reward distributions and $d=\min(2M-1,H)$. Our technique is a higher-order extension of the method-of-moments based approach, nevertheless, the design and analysis of the \algname algorithm requires several new ideas beyond existing techniques. We also provide a lower bound of $(SA)^{\Omega(\sqrt{M})} / \epsilon^{2}$ for a general instance of RMMDP, supporting that super-polynomial sample complexity in $M$ is necessary.
翻译:我们考虑在奖励混合马可夫决策程序中进行上层强化学习。 在这项工作中,我们解决了RMDP模式中仍然存在的几个未决问题。对于任意的 $Mge2 美元,我们提供一个抽样效率高的算法-$- textt{EM%2$- 代理方在整个集集中与MDP互动, 以美元为时序。 我们的目标是学习一个接近最佳的政策, 使这个模型中的时间步骤累积收益几乎最大化。 先前的工作为RMDP设定了一个上限, 美元=2美元。 在这项工作中, 我们解决了几个尚未解决的问题。 对于一个任意的 $MGe2 美元, 我们提供了一个抽样效率高的超级算法- $- textttt{ { EM% 2$-, 即输出值为$@ O} left( ==)\ cd Sqd Ad Ad\ cdod\ cdod\ tutretralt} (H) a leg- calmoal- proal productional proflemental 方法和 a mal- produdeal produdeal m) a mal produdeal produdeal m) a.