Motivated by the wide adoption of reinforcement learning (RL) in real-world personalized services, where users' sensitive and private information needs to be protected, we study regret minimization in finite-horizon Markov decision processes (MDPs) under the constraints of differential privacy (DP). Compared to existing private RL algorithms that work only on tabular finite-state, finite-actions MDPs, we take the first step towards privacy-preserving learning in MDPs with large state and action spaces. Specifically, we consider MDPs with linear function approximation (in particular linear mixture MDPs) under the notion of joint differential privacy (JDP), where the RL agent is responsible for protecting users' sensitive data. We design two private RL algorithms that are based on value iteration and policy optimization, respectively, and show that they enjoy sub-linear regret performance while guaranteeing privacy protection. Moreover, the regret bounds are independent of the number of states, and scale at most logarithmically with the number of actions, making the algorithms suitable for privacy protection in nowadays large-scale personalized services. Our results are achieved via a general procedure for learning in linear mixture MDPs under changing regularizers, which not only generalizes previous results for non-private learning, but also serves as a building block for general private reinforcement learning.
翻译:由于在现实世界个人化服务中广泛采用强化学习(RL),需要保护用户敏感和私人信息,因此,我们研究在有差别的隐私限制下,将有限Horizon Markov(MDPs)决策程序(MDPs)的最小化。 与目前仅对表格有限状态、有限动作MDPs工作的现有私人RL算法相比,我们迈出了第一步,在拥有庞大州和行动空间的MDP中进行隐私保护学习。具体地说,我们根据联合差异隐私权(JDP)的理念,考虑具有线性功能近似(特别是线性混合 MDPs)的MDPs(特别是线性混合 MDPs)的MDPs。 RL代理负责保护用户敏感数据。我们设计了两种私人RL算法(MDPs)的最小化过程。我们设计了两种私人RL算法,分别基于价值的反复转换和政策优化,并表明它们享有亚线性遗憾表现,同时保证隐私保护。此外,遗憾界限独立于州的数目,在大多数的私人行动中,使目前大规模个人化的个人化服务中适合隐私保护的算法。我们的成果通过一般学习程序实现的升级的学习的结果。