Large-scale AI systems that combine search and learning have reached super-human levels of performance in game-playing, but have also been shown to fail in surprising ways. The brittleness of such models limits their efficacy and trustworthiness in real-world deployments. In this work, we systematically study one such algorithm, AlphaZero, and identify two phenomena related to the nature of exploration. First, we find evidence of policy-value misalignment -- for many states, AlphaZero's policy and value predictions contradict each other, revealing a tension between accurate move-selection and value estimation in AlphaZero's objective. Further, we find inconsistency within AlphaZero's value function, which causes it to generalize poorly, despite its policy playing an optimal strategy. From these insights we derive VISA-VIS: a novel method that improves policy-value alignment and value robustness in AlphaZero. Experimentally, we show that our method reduces policy-value misalignment by up to 76%, reduces value generalization error by up to 50%, and reduces average value error by up to 55%.
翻译:将搜索和学习结合起来的大规模AI系统在游戏游戏中达到超人性水平,但也以令人惊讶的方式显示其失败。 这些模型的微弱限制了它们在现实世界部署中的功效和可信度。 在这项工作中,我们系统地研究一种算法,即阿尔法泽罗,并找出与勘探性质有关的两种现象。首先,我们发现政策价值不匹配的证据 -- -- 对于许多国家来说,阿尔法泽罗的政策和价值预测相互矛盾,在阿尔法泽罗的目标中揭示出准确的移动选择和价值估计之间的矛盾。此外,我们发现阿尔法泽罗的价值功能存在不一致之处,尽管其政策正在发挥最佳战略作用,但导致其普遍化不力。我们从这些洞察中得出VISA-VIS:一种新颖的方法,在阿尔法泽罗改进政策价值的对齐和价值的稳健性。实验性,我们显示我们的方法将政策价值不匹配减少高达76%,将价值一般误差降低到50%,并将平均误差降低到55%。