We present the Adaptive Entropy Tree Search (ANTS) algorithm, a planning method based on the Principle of Maximum Entropy. Importantly, we design ANTS so that it is a practical component of a planning-learning loop, outperforming state-of-the-art methods on the Atari benchmark. The key algorithmic novelty is entropy parameterization, which mitigates sensitivity to the temperature parameter - a bottleneck of the prior maximum entropy planning methods. To confirm our design choices, we perform a comprehensive suite of ablations in isolation from learning. Moreover, we theoretically show that ANTS enjoys exponential convergence in the softmax bandit setting.
翻译:我们提出了适应性英特罗比树搜索算法(ANTS),这是一种基于最大英特罗比原则的规划方法。重要的是,我们设计了ANTS,使之成为规划-学习循环的一个实用组成部分,优于阿塔里基准上最先进的方法。关键的算法新颖是酶参数化,这减轻了对温度参数的敏感度,这是先前最大英特罗比规划方法的一个瓶颈。为了确认我们的设计选择,我们从学习中分离出一套全面的推算。此外,我们理论上表明,ANTS在软式马克思土匪设置中具有指数趋同性。