In this paper we study the problem of stochastic multi-armed bandits (MAB) in the (local) differential privacy (DP/LDP) model. Unlike the previous results which need to assume bounded reward distributions, here we mainly focus on the case the reward distribution of each arm only has $(1+v)$-th moment with some $v\in (0, 1]$. In the first part, we study the problem in the central $\epsilon$-DP model. We first provide a near-optimal result by developing a private and robust Upper Confidence Bound (UCB) algorithm. Then, we improve the result via a private and robust version of the Successive Elimination (SE) algorithm. Finally, we show that the instance-dependent regret bound of our improved algorithm is optimal by showing its lower bound. In the second part of the paper, we study the problem in the $\epsilon$-LDP model. We propose an algorithm which could be seen as locally private and robust version of the SE algorithm, and show it could achieve (near) optimal rates for both instance-dependent and instance-independent regrets. All of the above results can also reveal the differences between the problem of private MAB with bounded rewards and heavy-tailed rewards. To achieve these (near) optimal rates, we develop several new hard instances and private robust estimators as byproducts, which might could be used to other related problems. Finally, experimental results also support our theoretical analysis and show the effectiveness of our algorithms.
翻译:在本文中,我们研究了(当地)不同隐私(DP/LDP)模式中多武装盗匪(MAB)问题。与以前需要承担约束性奖赏分配结果的结果不同,我们主要侧重于每个手臂的奖赏分配只有1+v美元,只有1美元(美元,0,1美元)第一刻。在本文第一部分,我们研究中央美元-DP模式的问题。我们首先通过开发一个私密和强健的超信任(UCB)算法来提供近于最佳的结果。然后,我们通过一个私密和稳健的超信任(UCB)算法版本来改进结果。最后,我们显示我们改进后算法的根据实例的遗憾约束是最好的(美元,0,1美元)第2部分,我们研究了美元-LDP模式的问题。我们提出了一种可以被视为本地私密和稳健的SE算法版本的算法,并表明它能够实现(近于)既靠实又依赖性的最佳比率。最后,我们展示了这些自我奖赏的最佳比率,我们最后又能展示了这些结果。