We present the first finite-sample analysis of policy evaluation in robust average-reward Markov Decision Processes (MDPs). Prior work in this setting have established only asymptotic convergence guarantees, leaving open the question of sample complexity. In this work, we address this gap by showing that the robust Bellman operator is a contraction under a carefully constructed semi-norm, and developing a stochastic approximation framework with controlled bias. Our approach builds upon Multi-Level Monte Carlo (MLMC) techniques to estimate the robust Bellman operator efficiently. To overcome the infinite expected sample complexity inherent in standard MLMC, we introduce a truncation mechanism based on a geometric distribution, ensuring a finite expected sample complexity while maintaining a small bias that decays exponentially with the truncation level. Our method achieves the order-optimal sample complexity of $\tilde{\mathcal{O}}(ε^{-2})$ for robust policy evaluation and robust average reward estimation, marking a significant advancement in robust reinforcement learning theory.


翻译:我们首次提出了鲁棒平均奖励马尔可夫决策过程(MDPs)中策略评估的有限样本分析。先前在该领域的研究仅建立了渐近收敛性保证,而未解决样本复杂度问题。本文通过证明鲁棒贝尔曼算子在精心构造的半范数下具有压缩性,并开发了一种具有可控偏差的随机逼近框架,填补了这一空白。我们的方法基于多级蒙特卡洛(MLMC)技术,以高效估计鲁棒贝尔曼算子。为克服标准MLMC固有的无限期望样本复杂度,我们引入了一种基于几何分布的截断机制,确保有限的期望样本复杂度,同时保持随截断水平指数衰减的小偏差。我们的方法在鲁棒策略评估和鲁棒平均奖励估计中实现了阶最优样本复杂度 $\tilde{\mathcal{O}}(ε^{-2})$,标志着鲁棒强化学习理论的重要进展。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员