Search is an important tool for computing effective policies in single- and multi-agent environments, and has been crucial for achieving superhuman performance in several benchmark fully and partially observable games. However, one major limitation of prior search approaches for partially observable environments is that the computational cost scales poorly with the amount of hidden information. In this paper we present \emph{Learned Belief Search} (LBS), a computationally efficient search procedure for partially observable environments. Rather than maintaining an exact belief distribution, LBS uses an approximate auto-regressive counterfactual belief that is learned as a supervised task. In multi-agent settings, LBS uses a novel public-private model architecture for underlying policies in order to efficiently evaluate these policies during rollouts. In the benchmark domain of Hanabi, LBS can obtain 55% ~ 91% of the benefit of exact search while reducing compute requirements by $35.8 \times$ ~ $4.6 \times$, allowing it to scale to larger settings that were inaccessible to previous search methods.
翻译:搜索是计算单一和多种试剂环境中有效政策的一个重要工具,对于在几个完全和部分可观测的游戏中实现超人性表现至关重要。 但是,对部分可观测环境的先前搜索方法的一个主要限制是,计算成本尺度与隐藏信息的数量相比差。 在本文中,我们展示了计算效率高的可部分可观测环境搜索程序 。 LBS 使用一种近似自动递减反事实的信念,作为监督任务学习。 在多试剂环境中, LBS 使用一个新的公私营模型结构来基础政策,以便在推出期间有效评估这些政策。 在Hanabi 基准领域, LBS 可以获得精确搜索的55%~ 91%的收益,同时将计算需求减少35.8 美元~ 4.6 美元,从而可以将其规模扩大到以前搜索方法无法进入的大环境。