Reverse Kullback-Leibler (KL) divergence-based regularization with respect to a fixed reference policy is widely used in modern reinforcement learning to preserve the desired traits of the reference policy and sometimes to promote exploration (using uniform reference policy, known as entropy regularization). Beyond serving as a mere anchor, the reference policy can also be interpreted as encoding prior knowledge about good actions in the environment. In the context of alignment, recent game-theoretic approaches have leveraged KL regularization with pretrained language models as reference policies, achieving notable empirical success in self-play methods. Despite these advances, the theoretical benefits of KL regularization in game-theoretic settings remain poorly understood. In this work, we develop and analyze algorithms that provably achieve improved sample efficiency under KL regularization. We study both two-player zero-sum Matrix games and Markov games: for Matrix games, we propose OMG, an algorithm based on best response sampling with optimistic bonuses, and extend this idea to Markov games through the algorithm SOMG, which also uses best response sampling and a novel concept of superoptimistic bonuses. Both algorithms achieve a logarithmic regret in $T$ that scales inversely with the KL regularization strength $\beta$ in addition to the standard $\widetilde{\mathcal{O}}(\sqrt{T})$ regret independent of $\beta$ which is attained in both regularized and unregularized settings
翻译:基于固定参考策略的反向Kullback-Leibler(KL)散度正则化在现代强化学习中被广泛使用,以保持参考策略的期望特性,有时也用于促进探索(使用均匀参考策略,即熵正则化)。除了作为锚点之外,参考策略也可以被解释为编码了关于环境中良好动作的先验知识。在对齐的背景下,最近的博弈论方法利用KL正则化,将预训练语言模型作为参考策略,在自博弈方法中取得了显著的实证成功。尽管有这些进展,KL正则化在博弈论环境中的理论优势仍然知之甚少。在这项工作中,我们开发并分析了在KL正则化下可证明实现更高样本效率的算法。我们研究了双人零和矩阵博弈和马尔可夫博弈:对于矩阵博弈,我们提出了OMG,一种基于具有乐观奖励的最佳响应采样的算法,并将这一思想通过算法SOMG扩展到马尔可夫博弈,该算法也使用最佳响应采样和一种新颖的超乐观奖励概念。两种算法都实现了关于$T$的对数遗憾,其尺度与KL正则化强度$\beta$成反比,此外还实现了与$\beta$无关的标准$\widetilde{\mathcal{O}}(\sqrt{T})$遗憾,这在正则化和非正则化设置中均可达到。