The stochastic generalised linear bandit is a well-understood model for sequential decision-making problems, with many algorithms achieving near-optimal regret guarantees under immediate feedback. However, in many real world settings, the requirement that the reward is observed immediately is not applicable. In this setting, standard algorithms are no longer theoretically understood. We study the phenomenon of delayed rewards in a theoretical manner by introducing a delay between selecting an action and receiving the reward. Subsequently, we show that an algorithm based on the optimistic principle improves on existing approaches for this setting by eliminating the need for prior knowledge of the delay distribution and relaxing assumptions on the decision set and the delays. This also leads to improving the regret guarantees from $ \widetilde O(\sqrt{dT}\sqrt{d + \mathbb{E}[\tau]})$ to $ \widetilde O(d\sqrt{T} + d^{3/2}\mathbb{E}[\tau])$, where $\mathbb{E}[\tau]$ denotes the expected delay, $d$ is the dimension and $T$ the time horizon and we have suppressed logarithmic terms. We verify our theoretical results through experiments on simulated data.
翻译:随机的笼统的线性线性匪帮是一种对顺序决策问题非常理解的模式,许多算法在直接反馈下实现了近于最佳的遗憾保证。 但是,在许多现实世界环境中,立即遵守奖励的要求是不适用的。 在这种环境下,标准算法不再在理论上理解。 我们从理论上研究延迟奖励现象,方法是在选择一个动作和接受奖励之间引入延迟。 随后, 我们显示基于乐观原则的算法改进了这一设置的现有方法, 消除了事先了解延迟分布的必要性, 并放松了对决定集和延迟的假设。 这还导致改善对美元( $\ plite O (sqrt{ d ⁇ sqrt{d{ tqrt{d{d+\mathb{E}) 的遗憾保证。 我们从宽度 O( d\qqrt{T} 和 接受奖赏之间的延迟。 我们的逻辑值是美元=mathb{E_trodealal laute) 。