We introduce the framework of performative reinforcement learning where the policy chosen by the learner affects the underlying reward and transition dynamics of the environment. Following the recent literature on performative prediction~\cite{Perdomo et. al., 2020}, we introduce the concept of performatively stable policy. We then consider a regularized version of the reinforcement learning problem and show that repeatedly optimizing this objective converges to a performatively stable policy under reasonable assumptions on the transition dynamics. Our proof utilizes the dual perspective of the reinforcement learning problem and may be of independent interest in analyzing the convergence of other algorithms with decision-dependent environments. We then extend our results for the setting where the learner just performs gradient ascent steps instead of fully optimizing the objective, and for the setting where the learner has access to a finite number of trajectories from the changed environment. For both the settings, we leverage the dual formulation of performative reinforcement learning and establish convergence to a stable solution. Finally, through extensive experiments on a grid-world environment, we demonstrate the dependence of convergence on various parameters e.g. regularization, smoothness, and the number of samples.
翻译:在学习者所选择的政策影响到环境的基本奖励和过渡动态的情况下,我们引入了绩效强化学习框架。根据最近关于绩效预测的文献,我们引入了绩效稳定政策的概念。我们随后审议了强化学习问题的常规版本,并表明,在过渡动态的合理假设下,反复优化这一目标会与绩效稳定政策相匹配。我们的证据利用了强化学习问题的双重视角,在分析其他算法与依赖决策的环境的融合方面可能具有独立的兴趣。然后,我们扩展了我们的结果,以确定学习者只是执行梯度步骤而不是充分优化目标,以及学习者能够从变化环境中获得数量有限的轨迹的环境。对于这两种环境,我们利用绩效强化学习的双重提法,建立与稳定解决方案的趋同。最后,通过对电网世界环境的广泛实验,我们展示了对各种参数的趋同性,例如正规化、平稳和样本数量的依赖性。