The multi-armed bandit (MAB) problem is an active learning framework that aims to select the best among a set of actions by sequentially observing rewards. Recently, it has become popular for a number of applications over wireless networks, where communication constraints can form a bottleneck. Existing works usually fail to address this issue and can become infeasible in certain applications. In this paper we address the communication problem by optimizing the communication of rewards collected by distributed agents. By providing nearly matching upper and lower bounds, we tightly characterize the number of bits needed per reward for the learner to accurately learn without suffering additional regret. In particular, we establish a generic reward quantization algorithm, QuBan, that can be applied on top of any (no-regret) MAB algorithm to form a new communication-efficient counterpart, that requires only a few (as low as 3) bits to be sent per iteration while preserving the same regret bound. Our lower bound is established via constructing hard instances from a subgaussian distribution. Our theory is further corroborated by numerically experiments.
翻译:多武装土匪(MAB)问题是一个积极的学习框架,目的是通过按顺序观察奖赏,在一系列行动中选择最佳行动。 最近,它在无线网络的若干应用中变得很受欢迎,因为通信限制可能形成瓶颈。 现有的工程通常无法解决这个问题,在某些应用中可能变得不可行。 在本文中,我们通过优化分配代理人收集的奖赏的交流来解决沟通问题。 通过提供近乎匹配的上下界限,我们严格确定每个奖赏所需的比特数,以便让学习者准确学习而不感到更多的遗憾。特别是,我们建立了一个通用的奖赏量化算法(QuBan),可以在任何(无-regret)MAB算法之上应用,以形成新的通信效率对应法,这只需要少量(低为3位)的比特,才能在保留相同遗憾约束的情况下发送一次。我们较低的约束是通过从子宫分布中构建硬实例来建立的。我们理论得到进一步的数字实验的证实。