Tree Search (TS) is crucial to some of the most influential successes in reinforcement learning. Here, we tackle two major challenges with TS that limit its usability: \textit{distribution shift} and \textit{scalability}. We first discover and analyze a counter-intuitive phenomenon: action selection through TS and a pre-trained value function often leads to lower performance compared to the original pre-trained agent, even when having access to the exact state and reward in future steps. We show this is due to a distribution shift to areas where value estimates are highly inaccurate and analyze this effect using Extreme Value theory. To overcome this problem, we introduce a novel off-policy correction term that accounts for the mismatch between the pre-trained value and its corresponding TS policy by penalizing under-sampled trajectories. We prove that our correction eliminates the above mismatch and bound the probability of sub-optimal action selection. Our correction significantly improves pre-trained Rainbow agents without any further training, often more than doubling their scores on Atari games. Next, we address the scalability issue given by the computational complexity of exhaustive TS that scales exponentially with the tree depth. We introduce Batch-BFS: a GPU breadth-first search that advances all nodes in each depth of the tree simultaneously. Batch-BFS reduces runtime by two orders of magnitude and, beyond inference, enables also training with TS of depths that were not feasible before. We train DQN agents from scratch using TS and show improvement in several Atari games compared to both the original DQN and the more advanced Rainbow.
翻译:树搜索( TS) 对一些最有影响力的强化学习成功至关重要 。 在这里, 我们用限制其可用性的 TS 应对两大挑战 :\ textit{ 分布转换} 和\ textit{ 缩放} 。 我们首先发现并分析反直观现象: 通过 TS 和预先培训的值值函数选择行动往往导致与原始培训前的代理商相比性能下降, 即便在获得准确状态并在未来步骤中给予奖励。 我们的校正明显改进了预先培训的彩虹代理商, 但没有经过任何进一步培训, 并且用极值理论来分析这一效果。 为了克服这一问题, 我们引入了一个全新的非政策性修正术语, 说明培训前值值与相应的TS政策政策之间的不匹配。 我们证明我们的校正消除了上述不匹配, 并约束了亚塔里运动选择的概率。 我们用最精确的深度来应对这个问题, 我们用计算性精度问题, 我们用最精确的TFS的深度来比高级的深度, 。