Reinforcement learning is made much more complex when the agent's observation is partial or noisy. This case corresponds to a partially observable Markov decision process (POMDP). One strategy to seek good performance in POMDPs is to endow the agent with a finite memory, whose update is governed by the policy. However, policy optimization is non-convex in that case and can lead to poor training performance for random initialization. The performance can be empirically improved by constraining the memory architecture, then sacrificing optimality to facilitate training. Here we study this trade-off in the two-arm bandit problem, and compare two extreme cases: (i) the random access memory where any transitions between $M$ memory states are allowed and (ii) a fixed memory where the agent can access its last $m$ actions and rewards. For (i), the probability $q$ to play the worst arm is known to be exponentially small in $M$ for the optimal policy. Our main result is to show that similar performance can be reached for (ii) as well, despite the simplicity of the memory architecture: using a conjecture on Gray-ordered binary necklaces, we find policies for which $q$ is exponentially small in $2^m$ i.e. $q\sim\alpha^{2^m}$ for some $\alpha < 1$. Interestingly, we observe empirically that training from random initialization leads to very poor results for (i), and significantly better results for (ii).
翻译:当代理器的观测是局部的或吵闹的时,强化学习就变得复杂得多。本案例与部分可见的Markov决定程序(POMDP)相对应,并比较两个极端案例:(一) 允许在美元记忆状态之间发生任何转变的随机访问存储结果,以及(二) 使代理器有一个固定的记忆,该记忆由该政策管理,而更新则由该政策管理。然而,在这种情况下,政策优化是非cavex的,可能导致随机初始化的训练性能差。通过限制记忆结构,可以改进业绩,然后牺牲优化培训。我们在这里研究双臂强盗问题中的这一交易,并比较两个极端案例:(一) 允许在美元记忆状态之间发生任何转变的随机访问存储结果,以及(二) 一个固定的记忆记录,该代理器可以访问最后的美元行动和奖励。对于(一) 最佳政策来说,美元最坏手臂的概率是极小的指数小的。我们的主要结果是显示(二) 类似业绩可以达到(二) 以及尽管记忆结构简洁:(一) 以美元为小的缩 美元, 美元 美元为正弦化的硬质的硬质的硬质的硬的硬质的硬质项项 。我们找到了一些政策在1美元中,我们发现政策使用了。