We study regret minimization in finite horizon tabular Markov decision processes (MDPs) under the constraints of differential privacy (DP). This is motivated by the widespread applications of reinforcement learning (RL) in real-world sequential decision making problems, where protecting users' sensitive and private information is becoming paramount. We consider two variants of DP -- joint DP (JDP), where a centralized agent is responsible for protecting users' sensitive data and local DP (LDP), where information needs to be protected directly on the user side. We first propose two general frameworks -- one for policy optimization and another for value iteration -- for designing private, optimistic RL algorithms. We then instantiate these frameworks with suitable privacy mechanisms to satisfy JDP and LDP requirements, and simultaneously obtain sublinear regret guarantees. The regret bounds show that under JDP, the cost of privacy is only a lower order additive term, while for a stronger privacy protection under LDP, the cost suffered is multiplicative. Finally, the regret bounds are obtained by a unified analysis, which, we believe, can be extended beyond tabular MDPs.
翻译:我们研究在有区别的隐私的限制下,在有限的地平线表Markov决策过程中,对最小化表示遗憾。这是在现实世界连续决策问题中广泛应用强化学习(RL)的动机,保护用户敏感和私人信息正在成为至高无上的。我们认为,DP的两个变式 -- -- 联合DP(JDP),中央机构负责保护用户敏感数据,地方DP(LDP),信息需要直接保护用户方面。我们首先提出两个一般性框架 -- -- 一个用于政策优化,另一个用于价值转换 -- -- 用于设计私人、乐观RL算法。然后,我们将这些框架与适当的隐私机制同步,以满足JDP和LDP的要求,同时获得亚线性遗憾保证。遗憾的是,在DP下,隐私成本只是一个较低的顺序添加术语,而在LDP下加强隐私保护,成本是倍增倍的。最后,通过统一分析获得的遗憾界限,我们认为,这一界限可以扩大到表格式的MDP。