We study the problem of computing an approximate Nash equilibrium of continuous-action game without access to gradients. Such game access is common in reinforcement learning settings, where the environment is typically treated as a black box. To tackle this problem, we apply zeroth-order optimization techniques that combine smoothed gradient estimators with equilibrium-finding dynamics. We model players' strategies using artificial neural networks. In particular, we use randomized policy networks to model mixed strategies. These take noise in addition to an observation as input and can flexibly represent arbitrary observation-dependent, continuous-action distributions. Being able to model such mixed strategies is crucial for tackling continuous-action games that lack pure-strategy equilibria. We evaluate the performance of our method using an approximation of the Nash convergence metric from game theory, which measures how much players can benefit from unilaterally changing their strategy. We apply our method to continuous Colonel Blotto games, single-item and multi-item auctions, and a visibility game. The experiments show that our method can quickly find high-quality approximate equilibria. Furthermore, they show that the dimensionality of the input noise is crucial for performance. To our knowledge, this paper is the first to solve general continuous-action games with unrestricted mixed strategies and without any gradient information.
 翻译:我们研究的是计算持续动作游戏的近似纳什平衡而不使用梯度的问题。 这种游戏访问在强化学习环境中很常见,环境通常被当作黑盒。 为了解决这个问题, 我们应用零顺序优化技术, 将平滑梯度估计器与平衡调查动态结合起来。 我们用人工神经网络模拟玩家的战略, 特别是, 我们使用随机化的政策网络来模拟混合策略。 这些策略除了以观察作为意见外, 也可以灵活地代表任意观测依赖的连续动作分布。 能够模拟这种混合策略对于解决缺乏纯战略平衡的连续动作游戏至关重要。 我们使用从游戏理论中近似纳什趋同度标准的方法来评估我们的方法的性能, 该标准用来衡量玩家从单方面改变策略中获益多少。 我们用我们的方法来模拟连续的布洛托上校游戏、 单项和多项拍卖以及可见性游戏。 这些实验显示, 我们的方法可以很快找到高质量的近似平衡性分布。 此外, 它们表明, 能够解决输入噪音的维度对于业绩至关重要 。 任何不限制性的行动 。