Researchers have demonstrated that neural networks are vulnerable to adversarial examples and subtle environment changes, both of which one can view as a form of distribution shift. To humans, the resulting errors can look like blunders, eroding trust in these agents. In prior games research, agent evaluation often focused on the in-practice game outcomes. While valuable, such evaluation typically fails to evaluate robustness to worst-case outcomes. Prior research in computer poker has examined how to assess such worst-case performance, both exactly and approximately. Unfortunately, exact computation is infeasible with larger domains, and existing approximations rely on poker-specific knowledge. We introduce ISMCTS-BR, a scalable search-based deep reinforcement learning algorithm for learning a best response to an agent, thereby approximating worst-case performance. We demonstrate the technique in several two-player zero-sum games against a variety of agents, including several AlphaZero-based agents.
翻译:研究人员已经表明,神经网络很容易受到对抗性实例和环境微妙变化的影响,两者都可被视为一种分配转移形式。对人类来说,由此产生的错误可能看起来像错误,削弱对这些代理人的信任。在以前的游戏研究中,代理评价往往侧重于实际操作游戏的结果。虽然这种评价很宝贵,但通常无法评价最坏情况结果的稳健性。计算机扑克先前的研究已经研究过如何准确和大致地评估这种最坏情况的性能。不幸的是,精确的计算在更大的领域是行不通的,而现有的近似则依赖于对扑克的特定知识。我们引入了基于可扩缩的搜索深度强化学习算法ISMCTS-BR,以学习对代理人的最佳反应,从而接近最坏情况的表现。我们展示了针对多种代理人,包括若干以阿尔法泽罗为基础的代理人的两次玩牌零和游戏的技术。