Reinforcement learning algorithms are widely used in domains where it is desirable to provide a personalized service. In these domains it is common that user data contains sensitive information that needs to be protected from third parties. Motivated by this, we study privacy in the context of finite-horizon Markov Decision Processes (MDPs) by requiring information to be obfuscated on the user side. We formulate this notion of privacy for RL by leveraging the local differential privacy (LDP) framework. We establish a lower bound for regret minimization in finite-horizon MDPs with LDP guarantees which shows that guaranteeing privacy has a multiplicative effect on the regret. This result shows that while LDP is an appealing notion of privacy, it makes the learning problem significantly more complex. Finally, we present an optimistic algorithm that simultaneously satisfies $\varepsilon$-LDP requirements, and achieves $\sqrt{K}/\varepsilon$ regret in any finite-horizon MDP after $K$ episodes, matching the lower bound dependency on the number of episodes $K$.
翻译:强化学习算法被广泛用于需要提供个性化服务的领域。在这些领域里,用户数据通常含有需要第三方保护的敏感信息。受此驱动,我们通过要求用户方面混淆信息,在有限和偏松的Markov决策程序(MDPs)中研究隐私问题。我们利用当地差异隐私框架来为RL构建这一隐私概念。我们为有限和偏松的 MDPs设定了一个较低限度的最小最小最小限度,而LDP的保障则表明保障隐私具有多重重复效应。这一结果显示,虽然LDP是一个吸引隐私的概念,但学习问题却更加复杂。最后,我们提出了一个同时满足$\varepsilon$-LDP要求的乐观算法,并在任何有限和偏松 MDP($K$)之后,在任何有限和偏松的 MDP中实现一个低限度的遗憾,与对美元数额的较低约束性依赖相匹配。