Temporal difference (TD) learning is a widely used method to evaluate policies in reinforcement learning. While many TD learning methods have been developed in recent years, little attention has been paid to preserving privacy and most of the existing approaches might face the concerns of data privacy from users. To enable complex representative abilities of policies, in this paper, we consider preserving privacy in TD learning with nonlinear value function approximation. This is challenging because such a nonlinear problem is usually studied in the formulation of stochastic nonconvex-strongly-concave optimization to gain finite-sample analysis, which would require simultaneously preserving the privacy on primal and dual sides. To this end, we employ a momentum-based stochastic gradient descent ascent to achieve a single-timescale algorithm, and achieve a good trade-off between meaningful privacy and utility guarantees of both the primal and dual sides by perturbing the gradients on both sides using well-calibrated Gaussian noises. As a result, our DPTD algorithm could provide $(\epsilon,\delta)$-differential privacy (DP) guarantee for the sensitive information encoded in transitions and retain the original power of TD learning, with the utility upper bounded by $\widetilde{\mathcal{O}}(\frac{(d\log(1/\delta))^{1/8}}{(n\epsilon)^{1/4}})$ (The tilde in this paper hides the log factor.), where $n$ is the trajectory length and $d$ is the dimension. Extensive experiments conducted in OpenAI Gym show the advantages of our proposed algorithm.
翻译:时间差异 (TD) 学习是用来评估强化学习政策的一种广泛使用的方法。 虽然近年来开发了许多TD学习方法, 但很少注意保护隐私, 而大多数现有方法可能面临用户对数据隐私的关切。 为了让政策具有复杂的代表性能力, 在本文件中, 我们考虑使用非线性值函数近似来保护TD学习中的隐私。 这样做具有挑战性, 因为通常在制订随机非线性的非convex- 强力 conculve 优化以获得有限抽样分析, 而这需要同时保护原始和双面的隐私。 为此, 我们使用基于动力的随机性梯度梯度梯度下降来实现单一时间级算法, 并且通过对双方的渐渐渐渐渐渐渐渐渐渐渐变异的渐变异变变。 因此, 我们的DPTD 算法可以提供$(\epsilon,\delta) $($- dal) 。 (DP) 以基于 $_\\\\\\\\\ laxal alalal 的轨算法, 来保证我们原始的智能数据流流流化的智能数据。