Agents that interact with other agents often do not know a priori what the other agents' strategies are, but have to maximise their own online return while interacting with and learning about others. The optimal adaptive behaviour under uncertainty over the other agents' strategies w.r.t. some prior can in principle be computed using the Interactive Bayesian Reinforcement Learning framework. Unfortunately, doing so is intractable in most settings, and existing approximation methods are restricted to small tasks. To overcome this, we propose to meta-learn approximate belief inference and Bayes-optimal behaviour for a given prior. To model beliefs over other agents, we combine sequential and hierarchical Variational Auto-Encoders, and meta-train this inference model alongside the policy. We show empirically that our approach outperforms existing methods that use a model-free approach, sample from the approximate posterior, maintain memory-free models of others, or do not fully utilise the known structure of the environment.
翻译:与其他代理商互动的代理商通常不知道其他代理商的战略是什么,但必须在与他人互动和学习时最大限度地实现自己的在线回报。 与其他代理商战略的不确定性下的最佳适应行为原则上可以使用Bayesian强化学习框架来计算。 不幸的是,这样做在多数情况下是棘手的,而现有的近似方法仅限于小任务。 为了克服这一点,我们提议对先前的某个特定代理商进行中产值的近似信念推论和巴耶斯最佳行为。 要模拟其他代理商的信仰,我们结合顺序和等级变化式自动电算器,并用元数据将这一推论模型与政策结合起来。 我们从经验上表明,我们的方法优于现有方法,即使用无模式的方法,从近似后方取样,保持其他人的无记忆模式,或者不充分利用已知的环境结构。