Reinforcement learning is generally difficult for partially observable Markov decision processes (POMDPs), which occurs when the agent's observation is partial or noisy. To seek good performance in POMDPs, one strategy is to endow the agent with a finite memory, whose update is governed by the policy. However, policy optimization is non-convex in that case and can lead to poor training performance for random initialization. The performance can be empirically improved by constraining the memory architecture, then sacrificing optimality to facilitate training. Here we study this trade-off in a two-hypothesis testing problem, akin to the two-arm bandit problem. We compare two extreme cases: (i) the random access memory where any transitions between $M$ memory states are allowed and (ii) a fixed memory where the agent can access its last $m$ actions and rewards. For (i), the probability $q$ to play the worst arm is known to be exponentially small in $M$ for the optimal policy. Our main result is to show that similar performance can be reached for (ii) as well, despite the simplicity of the memory architecture: using a conjecture on Gray-ordered binary necklaces, we find policies for which $q$ is exponentially small in $2^m$, i.e. $q\sim\alpha^{2^m}$ with $\alpha < 1$. In addition, we observe empirically that training from random initialization leads to very poor results for (i), and significantly better results for (ii) thanks to the constraints on the memory architecture.
翻译:部分可见的 Markov 决策流程( POMDPs) 通常很难强化学习。 当代理器的观测是部分或噪音时, 就会出现这种随机交易。 为了在 POMDPs 中寻求良好的表现, 我们的一个策略是给代理器留下有限的内存, 其更新由该政策管理。 但是, 政策优化是非隐形的, 并可能导致随机初始化的培训性能差。 在( i) 限制记忆结构, 从而可以实验性地改进业绩, 然后牺牲最佳性来便利培训。 我们在这里研究这个随机交易, 其测试问题与两股强力问题类似。 我们比较了两个极端案例:(i) 随机存取记忆记忆记忆, 允许在$的记忆状态之间发生任何转变, 其更新由政策管理。 (i) 用于最坏的手臂的概率很小, 用于最佳政策。 (i) 我们的主要结果是显示类似的业绩可以达到 (ii), 尽管记忆结构的简单性能导致 i- binalma) 的硬度 。 (i) (i) (i) i) (i) i) i) i) i) i) i) i i) i) i est est li est est est est est est estalestalmaisalmais) (我们 i) a) (我们 i) abalma) a) a) a) abisalbisaltialbaltialtialtial