The standard problem setting in Dec-POMDPs is self-play, where the goal is to find a set of policies that play optimally together. Policies learned through self-play may adopt arbitrary conventions and implicitly rely on multi-step reasoning based on fragile assumptions about other agents' actions and thus fail when paired with humans or independently trained agents at test time. To address this, we present off-belief learning (OBL). At each timestep OBL agents follow a policy $\pi_1$ that is optimized assuming past actions were taken by a given, fixed policy ($\pi_0$), but assuming that future actions will be taken by $\pi_1$. When $\pi_0$ is uniform random, OBL converges to an optimal policy that does not rely on inferences based on other agents' behavior (an optimal grounded policy). OBL can be iterated in a hierarchy, where the optimal policy from one level becomes the input to the next, thereby introducing multi-level cognitive reasoning in a controlled manner. Unlike existing approaches, which may converge to any equilibrium policy, OBL converges to a unique policy, making it suitable for zero-shot coordination (ZSC). OBL can be scaled to high-dimensional settings with a fictitious transition mechanism and shows strong performance in both a toy-setting and the benchmark human-AI & ZSC problem Hanabi.
翻译:Dec-POMDPs的标准问题设置是自我游戏,目标是找到一套最佳共同作用的政策。通过自我游戏学习的政策可能通过任意的公约,并隐含地依赖基于其他代理人行动的脆弱假设的多步推理,从而在测试时与人或独立培训代理人对齐时失败。要解决这个问题,我们提出脱离信仰的学习(OBL) 。在每次时间步骤上,OBL代理机构都遵循一个政策$\pi_1美元,假设过去的行动是由一个特定固定的政策($\pi_0美元)采取的,但假设未来行动将由$\pi_1美元采取。当$\pi_0美元是统一的随机假设时,OBL会汇集到一个不依赖其他代理人行为的推断的最佳政策(一个最佳的基于政策 ) 。 OBL 可以在一个层次上插入下一个层次的最佳政策, 从而以控制的方式引入多层次的认知推理。 与现有的方法不同, 可能与任何平衡政策趋同, Opi_1$1$1$1$0, 当OBL的高度的过渡机制都与一个独特的标准, AS-AS-SS-ass-assimal-assimal-assimimimlical-assing a cloging a dal laud lax irgal lax lax irgal-toal-s