双层强化学习中的样本复杂度界限研究 (On The Sample Complexity Bounds In Bilevel Reinforcement Learning)

Bilevel reinforcement learning (BRL) has emerged as a powerful framework for aligning generative models, yet its theoretical foundations, especially sample complexity bounds, remain underexplored. In this work, we present the first sample complexity bound for BRL, establishing a rate of $\mathcal{O}(\epsilon^{-3})$ in continuous state-action spaces. Traditional MDP analysis techniques do not extend to BRL due to its nested structure and non-convex lower-level problems. We overcome these challenges by leveraging the Polyak-{\L}ojasiewicz (PL) condition and the MDP structure to obtain closed-form gradients, enabling tight sample complexity analysis. Our analysis also extends to general bi-level optimization settings with non-convex lower levels, where we achieve state-of-the-art sample complexity results of $\mathcal{O}(\epsilon^{-3})$ improving upon existing bounds of $\mathcal{O}(\epsilon^{-6})$. Additionally, we address the computational bottleneck of hypergradient estimation by proposing a fully first-order, Hessian-free algorithm suitable for large-scale problems.

翻译：双层强化学习（BRL）已成为对齐生成模型的有力框架，但其理论基础，尤其是样本复杂度界限，仍未得到充分探索。本文首次提出了BRL的样本复杂度界限，在连续状态-动作空间中确立了$\mathcal{O}(\epsilon^{-3})$的收敛速率。由于其嵌套结构和非凸下层问题，传统的马尔可夫决策过程分析技术无法直接推广到BRL。我们通过利用Polyak-{\L}ojasiewicz（PL）条件和MDP结构获得闭式梯度，克服了这些挑战，从而实现了严格的样本复杂度分析。我们的分析还扩展到具有非凸下层的一般双层优化场景，在此我们取得了$\mathcal{O}(\epsilon^{-3})$的先进样本复杂度结果，优于现有的$\mathcal{O}(\epsilon^{-6})$界限。此外，针对超梯度估计的计算瓶颈，我们提出了一种完全一阶、无需海森矩阵的算法，适用于大规模问题。