Motivated by personalized healthcare and other applications involving sensitive data, we study online exploration in reinforcement learning with differential privacy (DP) constraints. Existing work on this problem established that no-regret learning is possible under joint differential privacy (JDP) and local differential privacy (LDP) but did not provide an algorithm with optimal regret. We close this gap for the JDP case by designing an $\epsilon$-JDP algorithm with a regret of $\widetilde{O}(\sqrt{SAH^2T}+S^2AH^3/\epsilon)$ which matches the information-theoretic lower bound of non-private learning for all choices of $\epsilon> S^{1.5}A^{0.5} H^2/\sqrt{T}$. In the above, $S$, $A$ denote the number of states and actions, $H$ denotes the planning horizon, and $T$ is the number of steps. To the best of our knowledge, this is the first private RL algorithm that achieves \emph{privacy for free} asymptotically as $T\rightarrow \infty$. Our techniques -- which could be of independent interest -- include privately releasing Bernstein-type exploration bonuses and an improved method for releasing visitation statistics. The same techniques also imply a slightly improved regret bound for the LDP case.
翻译:在个人化的医疗保健和其他涉及敏感数据的应用程序的推动下,我们研究在线探索,以不同隐私(DP)的限制加强学习(DP),关于该问题的现有工作确定,在共同的不同隐私(JDP)和当地差异隐私(LDP)下,不进行无雷学习是可能的,但是没有提供最佳的算法。我们为JDP案设计了美元-JDP算法,对全方位Tortelde{O}(sqrt{SAH2T}S ⁇ 2S ⁇ 2AH3/\epsilon)表示遗憾,从而弥补了这一差距。我们最了解的是,这是第一个私人的RLL算法,它与非私人学习的信息理论的较低约束性较低约束性下约束性约束性约束性约束性约束性约束性约束性学习(S\ 1.5}A*0.5}H*\qrqrqr> 在所有选择下,在共同选择上, $S$, $A$ 表示国家和行动的数量,$ 表示规划前景, $T 是步骤的数量。据我们所知,这是第一次私人算算算得轻度的私人的私人算算法, 免费自由学习自由学习自由学习的免费学习, 也包括我们更隐性探索的改良的保密性调查。