Learning a near optimal policy in a partially observable system remains an elusive challenge in contemporary reinforcement learning. In this work, we consider episodic reinforcement learning in a reward-mixing Markov decision process (MDP). There, a reward function is drawn from one of multiple possible reward models at the beginning of every episode, but the identity of the chosen reward model is not revealed to the agent. Hence, the latent state space, for which the dynamics are Markovian, is not given to the agent. We study the problem of learning a near optimal policy for two reward-mixing MDPs. Unlike existing approaches that rely on strong assumptions on the dynamics, we make no assumptions and study the problem in full generality. Indeed, with no further assumptions, even for two switching reward-models, the problem requires several new ideas beyond existing algorithmic and analysis techniques for efficient exploration. We provide the first polynomial-time algorithm that finds an $\epsilon$-optimal policy after exploring $\tilde{O}(poly(H,\epsilon^{-1}) \cdot S^2 A^2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively. This is the first efficient algorithm that does not require any assumptions in partially observed environments where the observation space is smaller than the latent state space.
翻译:在部分可见的系统中学习近乎最佳的政策仍然是当代强化学习的一个难以克服的挑战。 在这项工作中,我们认为在奖励混合的马尔科夫决策程序中(MDP)进行分层强化学习。 在每个事件开始时,都会从多种可能的奖励模式中产生一个奖励功能, 但选择的奖励模式的身份并没有透露给代理人。 因此, 动力是马尔科维安的潜伏状态空间没有提供给代理人。 我们研究的是, 学习两种奖励混合 MDP 的近于最佳政策的问题。 与现有方法不同, 这种方法依赖于强力的动态假设, 我们没有做出假设, 并且完全笼统地研究问题。 事实上, 没有进一步的假设, 即使是两个转换的奖励模式, 问题要求除现有的算法和分析技术外, 代理人们没有透露出几个新的想法。 因此, 我们提供了第一个多边- 时间算法, 在探索 $\pilde{O} (poly,\iplon ⁇ -1} (poly, poly) 最佳政策问题。 与现有的方法不同, 我们没有做出假设, $S%2 A=xx 假设, 这里的时值空间是部分的状态。