Recurrent meta reinforcement learning (meta-RL) agents are agents that employ a recurrent neural network (RNN) for the purpose of "learning a learning algorithm". After being trained on a pre-specified task distribution, the learned weights of the agent's RNN are said to implement an efficient learning algorithm through their activity dynamics, which allows the agent to quickly solve new tasks sampled from the same distribution. However, due to the black-box nature of these agents, the way in which they work is not yet fully understood. In this study, we shed light on the internal working mechanisms of these agents by reformulating the meta-RL problem using the Partially Observable Markov Decision Process (POMDP) framework. We hypothesize that the learned activity dynamics is acting as belief states for such agents. Several illustrative experiments suggest that this hypothesis is true, and that recurrent meta-RL agents can be viewed as agents that learn to act optimally in partially observable environments consisting of multiple related tasks. This view helps in understanding their failure cases and some interesting model-based results reported in the literature.
翻译:经常性元强化学习(meta-RL)代理商是使用经常性神经网络(RNN)“学习学习算法”的代理商。在接受了关于预先指定任务分布的培训后,该代理商的学习重量据说通过其活动动态实施了有效的学习算法,使该代理商能够迅速解决从同一分布中抽样的新任务。然而,由于这些代理商的黑箱性质,他们的工作方式尚未完全理解。在本研究中,我们通过使用部分可观测的Markov 决策程序(POMDP)框架重新界定了这些代理商的内部工作机制。我们假设,所学的活动动态是作为这些代理商的信念。若干说明性实验表明,这一假设是真实的,并且经常的元-RL代理商可以被视为在由多个相关任务组成的部分可观测环境中学习最佳行动的代理商。这种观点有助于了解其失败案例和文献中报道的一些有趣的模型结果。