通过尽量减少最佳反应组合的可利用性,实现电子平衡 (Computing equilibria by minimizing exploitability with best-response ensembles)

In this paper, we study the problem of computing an approximate Nash equilibrium of a continuous game. Such games naturally model many situations involving space, time, money, and other fine-grained resources or quantities. The standard measure of the closeness of a strategy profile to Nash equilibrium is exploitability, which measures how much utility players can gain from changing their strategy unilaterally. We introduce a new equilibrium-finding method that minimizes an approximation of the exploitability. This approximation employs a best-response ensemble for each player that maintains multiple candidate best responses for that player. In each iteration, the best-performing element of each ensemble is used in a gradient-based scheme to update the current strategy profile. The strategy profile and best-response ensembles are simultaneously trained to minimize and maximize the approximate exploitability, respectively. Experiments on a suite of benchmark games show that it outperforms previous methods.

翻译：在本文中, 我们研究计算一个连续游戏的近似 Nash 平衡的问题。这种游戏自然会模拟许多涉及空间、时间、金钱和其他精密资源或数量的情况。战略配置与 Nash 平衡的接近度的标准衡量尺度是可开发性, 它衡量着玩家从单方面改变策略中获得多大的效用。我们引入一种新的均衡调查方法, 最大限度地减少可开发性近似值。这个近似为每个玩家使用最佳反应组合, 以保持该玩家的多个候选最佳反应。在每次循环中, 每一个组合的最佳性元素都用于基于梯度的图案更新当前策略配置。战略配置和最佳反应组合同时受到训练, 以尽量减少和尽量扩大可能的可开发性。在一系列基准游戏上进行的实验显示, 它比以前的方法要好。