Many existing reinforcement learning (RL) methods employ stochastic gradient iteration on the back end, whose stability hinges upon a hypothesis that the data-generating process mixes exponentially fast with a rate parameter that appears in the step-size selection. Unfortunately, this assumption is violated for large state spaces or settings with sparse rewards, and the mixing time is unknown, making the step size inoperable. In this work, we propose an RL methodology attuned to the mixing time by employing a multi-level Monte Carlo estimator for the critic, the actor, and the average reward embedded within an actor-critic (AC) algorithm. This method, which we call \textbf{M}ulti-level \textbf{A}ctor-\textbf{C}ritic (MAC), is developed especially for infinite-horizon average-reward settings and neither relies on oracle knowledge of the mixing time in its parameter selection nor assumes its exponential decay; it, therefore, is readily applicable to applications with slower mixing times. Nonetheless, it achieves a convergence rate comparable to the state-of-the-art AC algorithms. We experimentally show that these alleviated restrictions on the technical conditions required for stability translate to superior performance in practice for RL problems with sparse rewards.
翻译:许多现有的强化学习(RL)方法在后端采用随机梯度变异,其稳定性取决于以下假设:数据生成过程与在步数选择中出现的速率参数成倍地快速混合。 不幸的是,对于大型国家空间或环境而言,这一假设被违反,回报微弱,混合时间不详,使步数无法操作。在这项工作中,我们提议一种RL方法,通过对评论家、演员和行为者采用多层次的蒙特卡洛测深器来适应混合时间,以及行为者-捷克算法(AC)中包含的平均奖赏。尽管如此,我们称之为\textbf{M}multi-le level level \ textbf{A}}A}ctor- textbf{C}ritic (MACC) 的方法,特别为无限偏差平均反向环境开发了这一假设,既不依赖其参数选择中混合时间的知识,也不假设其指数衰减;因此,该方法很容易适用于混合时间较慢的应用。尽管如此,它还是实现了与州-州-艺术混合算算算出的趋近的趋近的趋近的趋近的趋同AC级演算法,但我们用这些技术测测测测测测测测得这些技术问题。