Bilevel reinforcement learning (BRL) has emerged as a powerful framework for aligning generative models, yet its theoretical foundations, especially sample complexity bounds, remain underexplored. In this work, we present the first sample complexity bound for BRL, establishing a rate of $\mathcal{O}(\epsilon^{-3})$ in continuous state-action spaces. Traditional MDP analysis techniques do not extend to BRL due to its nested structure and non-convex lower-level problems. We overcome these challenges by leveraging the Polyak-{\L}ojasiewicz (PL) condition and the MDP structure to obtain closed-form gradients, enabling tight sample complexity analysis. Our analysis also extends to general bi-level optimization settings with non-convex lower levels, where we achieve state-of-the-art sample complexity results of $\mathcal{O}(\epsilon^{-3})$ improving upon existing bounds of $\mathcal{O}(\epsilon^{-6})$. Additionally, we address the computational bottleneck of hypergradient estimation by proposing a fully first-order, Hessian-free algorithm suitable for large-scale problems.
翻译:双层强化学习(BRL)已成为对齐生成模型的有力框架,但其理论基础,尤其是样本复杂度界限,仍未得到充分探索。本文首次提出了BRL的样本复杂度界限,在连续状态-动作空间中确立了$\mathcal{O}(\epsilon^{-3})$的收敛速率。由于其嵌套结构和非凸下层问题,传统的马尔可夫决策过程分析技术无法直接推广到BRL。我们通过利用Polyak-{\L}ojasiewicz(PL)条件和MDP结构获得闭式梯度,克服了这些挑战,从而实现了严格的样本复杂度分析。我们的分析还扩展到具有非凸下层的一般双层优化场景,在此我们取得了$\mathcal{O}(\epsilon^{-3})$的先进样本复杂度结果,优于现有的$\mathcal{O}(\epsilon^{-6})$界限。此外,针对超梯度估计的计算瓶颈,我们提出了一种完全一阶、无需海森矩阵的算法,适用于大规模问题。