The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode $k$ is revealed only in the end of episode $k + d^k$, where the delay $d^k$ can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve near-optimal $\sqrt{K + D}$ regret, where $K$ is the number of episodes and $D = \sum_{k=1}^K d^k$ is the total delay, significantly improving upon the best known regret bound of $(K + D)^{2/3}$.
翻译:强化学习(RL)的标准假设是,代理商立即观察对其行动的反馈,但在实践中,往往会延迟看到反馈。本文研究了在未明的过渡、对抗性变化成本和无限制延迟的土匪反馈的Supsodic Markov 决策程序中在线学习。更准确地说,对于代理商的回馈只在第1集(k) + däk美元中披露,该集的延迟可以改变事件,由不明对手选择。我们介绍了实现近乎最佳的 $sqrt{K + D}美元的第一个算法,其中K$是事件数,$=\ sumäk=1 ⁇ K d ⁇ k美元是全部延迟,大大改进了已知最遗憾的$(K+D) =2/3美元。