We study model-free reinforcement learning (RL) algorithms in episodic non-stationary constrained Markov Decision Processes (CMDPs), in which an agent aims to maximize the expected cumulative reward subject to a cumulative constraint on the expected utility (cost). In the non-stationary environment, reward, utility functions, and transition kernels can vary arbitrarily over time as long as the cumulative variations do not exceed certain variation budgets. We propose the first model-free, simulator-free RL algorithms with sublinear regret and zero constraint violation for non-stationary CMDPs in both tabular and linear function approximation settings with provable performance guarantees. Our results on regret bound and constraint violation for the tabular case match the corresponding best results for stationary CMDPs when the total budget is known. Additionally, we present a general framework for addressing the well-known challenges associated with analyzing non-stationary CMDPs, without requiring prior knowledge of the variation budget. We apply the approach for both tabular and linear approximation settings.
翻译:在非静止环境中,奖励、公用事业功能和过渡内核可随时间而任意变化,只要累积变化不超过某些变化预算。我们建议第一个无模型、无模拟-无流动RL算法,其中含有分线悔恨和零约束值,适用于表格式和线性功能近似环境的非固定性 MCD,并有可变业绩保证。我们对表格式和线性近似环境采用的方法,与表格式和线性近似环境相适应。我们在总预算已知的情况下,对违反表格式和线性近似环境的遗憾约束和制约,与固定式 CMDP的相应最佳结果相匹配。此外,我们提出了一个总体框架,用以应对与分析非静止性 CMDP有关的众所周知的挑战,而无需事先了解变异预算。我们对表格和线性近似环境都适用了方法。</s>