The standard problem setting in Dec-POMDPs is self-play, where the goal is to find a set of policies that play optimally together. Policies learned through self-play may adopt arbitrary conventions and rely on multi-step counterfactual reasoning based on assumptions about other agents' actions and thus fail when paired with humans or independently trained agents. In contrast, no current methods can learn optimal policies that are fully grounded, i.e., do not rely on counterfactual information from observing other agents' actions. To address this, we present off-belief learning} (OBL): at each time step OBL agents assume that all past actions were taken by a given, fixed policy ($\pi_0$), but that future actions will be taken by an optimal policy under these same assumptions. When $\pi_0$ is uniform random, OBL learns the optimal grounded policy. OBL can be iterated in a hierarchy, where the optimal policy from one level becomes the input to the next. This introduces counterfactual reasoning in a controlled manner. Unlike independent RL which may converge to any equilibrium policy, OBL converges to a unique policy, making it more suitable for zero-shot coordination. OBL can be scaled to high-dimensional settings with a fictitious transition mechanism and shows strong performance in both a simple toy-setting and the benchmark human-AI/zero-shot coordination problem Hanabi.
翻译:Dec-POMDPs的标准问题设定是自我游戏,目标是找到一套最佳共同作用的政策。通过自我游戏学习的政策可以采用任意的公约,依靠基于其他代理人行动的假设的多步反事实推理,因此当与人或独立培训的代理人配对时失败。相反,目前没有任何方法能够学习完全基于的最佳政策,即不依赖观察其他代理人的行动的反事实信息。为了解决这个问题,我们提出偏离信仰的学习}(OBL):OBL代理机构在每一步中都假设过去的所有行动都是由特定固定政策采取的($1_0美元),但根据这些假设,未来行动将以最佳政策为基础。当$1_0美元与人或独立培训代理人相匹配时,OBL学习最佳基础政策。OBL可以在一个等级中进行循环,最佳政策从一个层次变成下一个层次。这可以以控制的方式引入反事实推理。与独立的RL相比,它可能与任何平衡政策都一致($pi_0$0$0$0美元),而未来行动将以最佳政策为最佳的过渡机制向一个独特的标准。OB-iral-imal-cilal-comnical degilnical disgil