The stochastic generalised linear bandit is a well-understood model for sequential decision-making problems, with many algorithms achieving near-optimal regret guarantees under immediate feedback. However, the stringent requirement for immediate rewards is unmet in many real-world applications where the reward is almost always delayed. We study the phenomenon of delayed rewards in generalised linear bandits in a theoretical manner. We show that a natural adaptation of an optimistic algorithm to the delayed feedback achieves a regret bound where the penalty for the delays is independent of the horizon. This result significantly improves upon existing work, where the best known regret bound has the delay penalty increasing with the horizon. We verify our theoretical results through experiments on simulated data.
翻译:随机广义线性赌博机是一个解决连续决策问题的常用模型,许多算法在立即反馈下实现了接近最优的后悔保证。然而,在许多实际应用中,即时奖励的严格要求并不符合,因为奖励几乎总是被延迟。我们以理论的方式研究广义线性赌博机中的延迟奖励现象。我们展示了一种自然的乐观算法的延迟反馈适应,实现了一个后悔界限,其中延迟的惩罚与地平线无关。这一结果显著改进了现有的工作,其中已知的最佳后悔界限对于延迟的惩罚随着地平线的增加而增加。我们通过模拟数据的实验验证了我们的理论结果。