Post-deployment machine learning algorithms often influence the environments they act in, and thus shift the underlying dynamics that the standard reinforcement learning (RL) methods ignore. While designing optimal algorithms in this performative setting has recently been studied in supervised learning, the RL counterpart remains under-explored. In this paper, we prove the performative counterparts of the performance difference lemma and the policy gradient theorem in RL, and further introduce the Performative Policy Gradient algorithm (PePG). PePG is the first policy gradient algorithm designed to account for performativity in RL. Under softmax parametrisation, and also with and without entropy regularisation, we prove that PePG converges to performatively optimal policies, i.e. policies that remain optimal under the distribution shifts induced by themselves. Thus, PePG significantly extends the prior works in Performative RL that achieves performative stability but not optimality. Furthermore, our empirical analysis on standard performative RL environments validate that PePG outperforms standard policy gradient algorithms and the existing performative RL algorithms aiming for stability.
翻译:部署后的机器学习算法常常会影响其所处的环境,从而改变标准强化学习方法所忽略的底层动态。尽管在监督学习领域,针对这种表演性设定设计最优算法已有近期研究,但其在强化学习中的对应问题仍未得到充分探索。本文证明了强化学习中性能差异引理和策略梯度定理的表演性对应形式,并进一步提出了表演性策略梯度算法。PePG是首个为考虑强化学习中的表演性而设计的策略梯度算法。在Softmax参数化下,无论是否包含熵正则化,我们均证明了PePG能够收敛至表演性最优策略,即那些在由自身诱导的分布偏移下仍保持最优的策略。因此,PePG显著扩展了先前仅实现表演性稳定性而非最优性的表演性强化学习研究工作。此外,我们在标准表演性强化学习环境上的实证分析验证了PePG优于标准策略梯度算法以及现有旨在实现稳定性的表演性强化学习算法。