SARSA, a classical on-policy control algorithm for reinforcement learning, is known to chatter when combined with linear function approximation: SARSA does not diverge but oscillates in a bounded region. However, little is know about how fast SARSA converges to that region and how large the region is. In this paper, we make progress towards solving this open problem by showing the convergence rate of projected SARSA to a bounded region. Importantly, the region is much smaller than the ball used for projection provided that the the magnitude of the reward is not too large. Our analysis applies to expected SARSA as well as SARSA($\lambda$). Existing works regarding the convergence of linear SARSA to a fixed point all require the Lipschitz constant of SARSA's policy improvement operator to be sufficiently small; our analysis instead applies to arbitrary Lipschitz constants and thus characterizes the behavior of linear SARSA for a new regime.
翻译:SASA是用于强化学习的经典政策控制算法,在结合线性函数近似值时,人们知道SARSA是闲谈的:SASA没有出现差异,但在一个受约束区域中,SARSA是振荡的,然而,对于SASA与该地区交汇的速度和面积究竟有多大,却知之甚少。在本文件中,我们通过显示SARSA预测与受约束区域之间的趋同率,在解决这一公开问题方面取得了进展。重要的是,该地区比用于预测的球要小得多,条件是奖赏的规模并不太大。我们的分析适用于SARSA和SA($\lambda$),关于SARSA线性接近一个固定点的现有工作要求SA政策改进操作员Lipschitz常数必须足够小;我们的分析则适用于任意的Lipschitz常数,从而将SARSA的线性行为定性为一个新的制度。