There are many provably efficient algorithms for episodic reinforcement learning. However, these algorithms are built under the assumption that the sequences of states, actions and rewards associated with each episode arrive immediately, allowing policy updates after every interaction with the environment. This assumption is often unrealistic in practice, particularly in areas such as healthcare and online recommendation. In this paper, we study the impact of delayed feedback on several provably efficient algorithms for regret minimisation in episodic reinforcement learning. Firstly, we consider updating the policy as soon as new feedback becomes available. Using this updating scheme, we show that the regret increases by an additive term involving the number of states, actions, episode length and the expected delay. This additive term changes depending on the optimistic algorithm of choice. We also show that updating the policy less frequently can lead to an improved dependency of the regret on the delays.
翻译:然而,这些算法是在以下假设下建立的:每个事件相关的国家、行动和奖赏的顺序立即到来,允许在与环境的每次互动之后更新政策。这种假设在实践中往往不切实际,特别是在医疗保健和在线建议等领域。在本文中,我们研究了延迟反馈对若干可实现效率的算法的影响,以便在事后加固学习中将遗憾降到最低程度。首先,我们考虑在获得新的反馈后立即更新政策。我们利用这一更新计划,表明一个包含国家数量、行动、事件长度和预期延迟的添加术语增加了遗憾。这一添加术语的变化取决于乐观的选择算法。我们还表明,更新政策较少可能导致对拖延更加依赖遗憾。