There are many algorithms for regret minimisation in episodic reinforcement learning. This problem is well-understood from a theoretical perspective, providing that the sequences of states, actions and rewards associated with each episode are available to the algorithm updating the policy immediately after every interaction with the environment. However, feedback is almost always delayed in practice. In this paper, we study the impact of delayed feedback in episodic reinforcement learning from a theoretical perspective and propose two general-purpose approaches to handling the delays. The first involves updating as soon as new information becomes available, whereas the second waits before using newly observed information to update the policy. For the class of optimistic algorithms and either approach, we show that the regret increases by an additive term involving the number of states, actions, episode length, the expected delay and an algorithm-dependent constant. We empirically investigate the impact of various delay distributions on the regret of optimistic algorithms to validate our theoretical results.
翻译:翻译的题目:乐观和延迟在情节强化学习中的作用
翻译的摘要:对于情节强化学习的遗憾最小化问题,有许多算法可以实现遗憾最小化。从理论角度来看,只要与每个与环境交互的状态、动作和奖励序列即时更新策略的算法可用,该问题就是一个广为人知的问题。然而,在实践中,反馈几乎总是有延迟的。本文从理论角度研究了延迟反馈对情节强化学习的影响,并提出了两种处理延迟的通用方法。第一种方法是一旦有新信息就立即更新,而第二种方法则是等待使用新观测到的信息更新策略。对于乐观算法的类别和任一方法,我们表明遗憾增加了一个加性项,其中包括状态数、操作数、情节长度、预期延迟和一个与算法相关的常数。我们通过实验证明了各种延迟分布对乐观算法的遗憾的影响,从而验证了我们的理论结果。