使用碰撞通信的多武装多武装强盗 (On No-Sensing Adversarial Multi-player Multi-armed Bandits with Collision Communications)

from arxiv, 19 pages, 8 figures. Accepted to IEEE Journal on Selected Areas in Information Theory, Special Issue on Sequential, Active, and Reinforcement Learning

We study the notoriously difficult no-sensing adversarial multi-player multi-armed bandits (MP-MAB) problem from a new perspective. Instead of focusing on the hardness of multiple players, we introduce a new dimension of hardness, called attackability. All adversaries can be categorized based on the attackability and we introduce Adversary-Adaptive Collision-Communication (A2C2), a family of algorithms with forced-collision communication among players. Both attackability-aware and unaware settings are studied, and information-theoretic tools of the Z-channel model and error-correction coding are utilized to address the challenge of implicit communication without collision information in an adversarial environment. For the more challenging attackability-unaware problem, we propose a simple method to estimate the attackability enabled by a novel error-detection repetition code and randomized communication for synchronization. Theoretical analysis proves that asymptotic attackability-dependent sublinear regret can be achieved, with or without knowing the attackability. In particular, the asymptotic regret does not have an exponential dependence on the number of players, revealing a fundamental tradeoff between the two dimensions of hardness in this problem.

翻译：我们从新的角度来研究众所周知的困难的不遥感对抗性多玩家多武装匪徒(MP-MAB)问题。我们不注重多个玩家的强硬性,而是引入了一个新的硬性层面,称为攻击性。所有对手都可以根据攻击性进行分类,我们引入了反偏向-自动碰撞-通信(A2C2),一套在玩家之间进行强迫协作交流的算法。研究了攻击性觉悟和不知情的环境,并使用了Z频道模型和错误校正编码的信息理论工具来应对隐性通信的挑战,而没有在对抗性环境中的碰撞信息。关于更具挑战性的攻击性-无源问题,我们提出了一个简单的方法来估计攻击性,因为新的错误-探测重复代码和随机通信可以进行同步。理论分析证明,可以实现依赖攻击性的亚线性亚线性遗憾,无论是否知道攻击性。特别是,在攻击性遗憾对玩家数目的难度方面,都不存在指数性依赖。