As reinforcement learning techniques are increasingly applied to real-world decision problems, attention has turned to how these algorithms use potentially sensitive information. We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions. We give examples of how this setting covers real-world problems in privacy for sequential decision-making. We solve this problem in the policy gradients framework by introducing a regularizer based on the mutual information (MI) between the sensitive state and the actions at a given timestep. We develop a model-based stochastic gradient estimator for optimization of privacy-constrained policies. We also discuss an alternative MI regularizer that serves as an upper bound to our main MI regularizer and can be optimized in a model-free setting. We contrast previous work in differentially-private RL to our mutual-information formulation of information disclosure. Experimental results show that our training method results in policies which hide the sensitive state.
翻译:随着强化学习技术越来越多地应用于现实世界的决策问题,人们开始注意这些算法如何使用潜在的敏感信息。我们认为,培训政策的任务是最大限度地奖励,同时通过行动尽量减少某些敏感国家变量的披露。我们举例说明,这种环境如何涵盖现实世界的隐私问题,以便进行顺序决策。我们在政策梯度框架内采用基于敏感国家与特定时间步骤行动之间相互信息(MI)的正规化办法解决这个问题。我们开发了一个基于模型的随机梯度估计器,以优化不受隐私限制的政策。我们还讨论一个替代的MI常规化器,作为我们主要MI正规化器的上限,在无模式环境下可以优化。我们把以前在差异私人RL的工作与我们的信息披露的相互信息配方加以对比。实验结果显示,我们的培训方法导致隐藏敏感状态的政策。