具有部分分配反馈的有差异的私自线性强盗 (Differentially Private Linear Bandits with Partial Distributed Feedback)

In this paper, we study the problem of global reward maximization with only partial distributed feedback. This problem is motivated by several real-world applications (e.g., cellular network configuration, dynamic pricing, and policy selection) where an action taken by a central entity influences a large population that contributes to the global reward. However, collecting such reward feedback from the entire population not only incurs a prohibitively high cost but often leads to privacy concerns. To tackle this problem, we consider differentially private distributed linear bandits, where only a subset of users from the population are selected (called clients) to participate in the learning process and the central server learns the global model from such partial feedback by iteratively aggregating these clients' local feedback in a differentially private fashion. We then propose a unified algorithmic learning framework, called differentially private distributed phased elimination (DP-DPE), which can be naturally integrated with popular differential privacy (DP) models (including central DP, local DP, and shuffle DP). Furthermore, we prove that DP-DPE achieves both sublinear regret and sublinear communication cost. Interestingly, DP-DPE also achieves privacy protection "for free" in the sense that the additional cost due to privacy guarantees is a lower-order additive term. In addition, as a by-product of our techniques, the same results of "free" privacy can also be achieved for the standard differentially private linear bandits. Finally, we conduct simulations to corroborate our theoretical results and demonstrate the effectiveness of DP-DPE.

翻译：在本文中,我们研究全球奖励最大化问题,只提供部分分布的反馈。这个问题是由几个现实世界的应用(例如蜂窝网络配置、动态定价和政策选择)引发的,中央实体采取的行动影响到大量人口,有助于全球奖励。然而,从整个人口收集这种奖励反馈不仅费用高得惊人,而且往往导致隐私问题。为了解决这个问题,我们认为分布不一的私人线性盗贼,只有一部分人口用户(所谓的客户)被选中参与学习进程,中央服务器通过以不同私人方式反复将这些客户的当地反馈汇集起来,从这种部分反馈中学习全球模式。我们然后提出一个统一的算法学习框架,称为差别分配的私人分阶段消除(DP-DPE),这自然可以与大众差异隐私模式(包括中央DP、地方DP和冲刷DP)相结合。此外,我们证明DP-DPE在进行线性增长时,只有一组用户(所谓的客户)既能参与学习过程,又能从这种部分反馈中学习全球模式。有趣的是,DP-DP-E还能够以不同的方式将这些客户的当地反馈反复收集。我们隐私保护的“最后”的方式,从一种“免费的保证,从某种程度,从这个意义上说来,从某种保证是额外的,从某种保证,从某种保证,从某种意义上说,从某种意义上说,从一种保证,从一种保证,从一种保证,从一种保证,从一种保证,从某种意义上说,从一种保证,从一种保证,从一种保证,从某种意义上说,从一种保证,从一种到一种到一种到一种到一种保证,从一种保证,从一种保证,从一种到一种保证,从一种保证,从一种保证,从一种保证,从一种到一种保证,从一种到一种到一种说,从一种保证,从一种说,从一种说,从一种说,从一种保证,从一种保证,从一种说,从一种说,从一种说,从一种说,从一种说,从一种说,从一种说,从一种说,从一种说,从一种说,从一种说,从一种说,从一种说,从一种说,从一种说,从一种说,从一种说,从一种说,从一种说,从一种说,从一种说,从一种说,从一种说,从一种说,从一种说,从一种说,从一种说