Deep reinforcement learning agents may learn complex tasks more efficiently when they coordinate with one another. We consider a teacher-student coordination scheme wherein an agent may ask another agent for demonstrations. Despite the benefits of sharing demonstrations, however, potential adversaries may obtain sensitive information belonging to the teacher by observing the demonstrations. In particular, deep reinforcement learning algorithms are known to be vulnerable to membership attacks, which make accurate inferences about the membership of the entries of training datasets. Therefore, there is a need to safeguard the teacher against such privacy threats. We fix the teacher's policy as the context of the demonstrations, which allows for different internal models across the student and the teacher, and contrasts the existing methods. We make the following two contributions. (i) We develop a differentially private mechanism that protects the privacy of the teacher's training dataset. (ii) We propose a proximal policy-optimization objective that enables the student to benefit from the demonstrations despite the perturbations of the privacy mechanism. We empirically show that the algorithm improves the student's learning upon convergence rate and utility. Specifically, compared with an agent who learns the same task on its own, we observe that the student's policy converges faster, and the converging policy accumulates higher rewards more robustly.
翻译:深层强化学习代理在相互协调时,可以更有效地学习复杂的任务。我们考虑师生协调计划,教师/学生协调计划,教师/学生协调计划可以要求另一代理人员参加示威。尽管分享示威的好处,但潜在对手可以通过观察示威获得属于教师的敏感信息。特别是,人们知道深层强化学习算法容易受到会员攻击的伤害,从而准确推断培训数据集条目的入门情况。因此,有必要保护教师不受这种隐私威胁。我们把教师政策作为示范的背景,允许学生和教师使用不同的内部模型,并对比现有的方法。我们作出以下两种贡献:(一) 我们开发一种差别化的私人机制,保护教师培训数据集的隐私。 (二) 我们提出一种准ximpal政策优化目标,使学生能够从示威中受益,尽管隐私机制受到干扰。我们从经验上看,这种算法可以改进学生在趋同率和实用性方面的学习。具体地说,与一个学习自己越来越稳健的政策的代理人相比,我们观察自己的更高水平。