We study the problem of multi-armed bandits with $\epsilon$-global Differential Privacy (DP). First, we prove the minimax and problem-dependent regret lower bounds for stochastic and linear bandits that quantify the hardness of bandits with $\epsilon$-global DP. These bounds suggest the existence of two hardness regimes depending on the privacy budget $\epsilon$. In the high-privacy regime (small $\epsilon$), the hardness depends on a coupled effect of privacy and partial information about the reward distributions. In the low-privacy regime (large $\epsilon$), bandits with $\epsilon$-global DP are not harder than the bandits without privacy. For stochastic bandits, we further propose a generic framework to design a near-optimal $\epsilon$ global DP extension of an index-based optimistic bandit algorithm. The framework consists of three ingredients: the Laplace mechanism, arm-dependent adaptive episodes, and usage of only the rewards collected in the last episode for computing private statistics. Specifically, we instantiate $\epsilon$-global DP extensions of UCB and KL-UCB algorithms, namely AdaP-UCB and AdaP-KLUCB. AdaP-KLUCB is the first algorithm that both satisfies $\epsilon$-global DP and yields a regret upper bound that matches the problem-dependent lower bound up to multiplicative constants.
翻译:我们用美元研究多武装匪徒问题。 首先,我们证明,对于以美元-美元-全球差异隐私(DP)量化强盗的硬性程度的小型和问题导致的对低价格和线性强盗的低遗憾度,用美元-全球DP来量化强盗的硬性。这些界限表明,根据隐私预算存在两种硬性制度。在高特权制度(小价-美元)中,硬性取决于隐私机制的结合效应和关于奖赏分配的部分信息。在低特权制度(大价-欧普朗)中,以美元-全金-全金-全金的强盗并不比无隐私的强。对于强盗,我们进一步提出一个通用框架,设计一个几乎最佳的美元-全金-全金-全金-全金-全金-美元-ADBAD-CLUB 和ADB-CUBAD-G-CULA-AD-CLU。 该框架由三个要素组成: Laplace-P-依赖性适应事件,仅使用在最后一集中收集的奖项计算私人统计数字的奖状(G-G-CUD-CLB)的上,具体、我们即Slodial-C-C-C-CULAD-lal-lal-C-C-lal-lal-CU,即A-C-C-C-CUB 和AD-CUB 和AD-LA-lal-CU。