This paper investigates deceptive reinforcement learning for privacy preservation in model-free and continuous action space domains. In reinforcement learning, the reward function defines the agent's objective. In adversarial scenarios, an agent may need to both maximise rewards and keep its reward function private from observers. Recent research presented the ambiguity model (AM), which selects actions that are ambiguous over a set of possible reward functions, via pre-trained $Q$-functions. Despite promising results in model-based domains, our investigation shows that AM is ineffective in model-free domains due to misdirected state space exploration. It is also inefficient to train and inapplicable in continuous action space domains. We propose the deceptive exploration ambiguity model (DEAM), which learns using the deceptive policy during training, leading to targeted exploration of the state space. DEAM is also applicable in continuous action spaces. We evaluate DEAM in discrete and continuous action space path planning environments. DEAM achieves similar performance to an optimal model-based version of AM and outperforms a model-free version of AM in terms of path cost, deceptiveness and training efficiency. These results extend to the continuous domain.
翻译:本文研究了如何在无模型和连续动作空间的领域中使用欺骗性强化学习来实现隐私保护。在强化学习中,奖励函数定义了智能体的目标。在对抗行动的情况下,代理需要既最大化奖励,又保持其奖励函数对观察者保密。最近的研究提出了不确定性模型(AM),它通过预训练的$Q$函数选择模糊的动作,使其适用于模型为基础领域。然而,我们的研究表明,AM对于无模型领域来说是无效的,因为它无法正确探索状态空间。它还难以训练,并且无法应用于连续动作空间领域。我们提出了欺骗性探索不确定性模型(DEAM),这种训练方法使用了欺骗性策略,从而实现了对状态空间的有针对性的探索。DEAM同样适用于连续动作空间。我们在离散和连续的行动规划环境中评估了DEAM。DEAM实现了与AM基于模型的最优版本类似的性能,并在路径成本、欺骗性和训练效率方面优于AM的无模型版本。这些结果也适用于连续领域。