Opponent modeling has benefited a controlled agent's decision-making by constructing models of other agents. Existing methods commonly assume access to opponents' observations and actions, which is infeasible when opponents' behaviors are unobservable or hard to obtain. We propose a novel multi-agent distributional actor-critic algorithm to achieve imaginary opponent modeling with purely local information (i.e., the controlled agent's observations, actions, and rewards). Specifically, the actor maintains a speculated belief of the opponents, which we call the \textit{imaginary opponent models}, to predict opponents' actions using local observations and makes decisions accordingly. Further, the distributional critic models the return distribution of the policy. It reflects the quality of the actor and thus can guide the training of the imaginary opponent model that the actor relies on. Extensive experiments confirm that our method successfully models opponents' behaviors without their data and delivers superior performance against baseline methods with a faster convergence speed.
翻译:现有方法通常假定可以接触对手的观察和行动,当反对者的行为不可观察或难以获得时,这是行不通的。我们建议采用新型的多试剂分布式行为者-批评算法,用纯粹本地信息(即被控制代理人的观察、行动和奖赏)实现想象中的对手模型的模型。具体地说,行为者对反对者持有一种推测性的信念,我们称之为“Textit{imaginial 对手模型 ”, 以便利用当地观察来预测反对者的行动,并据此作出决定。此外,分布式评论家将政策的回报分布模型作为演员质量的模型,从而可以指导演员所依赖的想象中的对手模型的培训。广泛的实验证实,我们的方法成功地模拟反对者没有数据的行为,并以更快的趋同速度根据基线方法提供优异的性能。