We investigate properties of Thompson Sampling in the stochastic multi-armed bandit problem with delayed feedback. In a setting with i.i.d delays, we establish to our knowledge the first regret bounds for Thompson Sampling with arbitrary delay distributions, including ones with unbounded expectation. Our bounds are qualitatively comparable to the best available bounds derived via ad-hoc algorithms, and only depend on delays via selected quantiles of the delay distributions. Furthermore, in extensive simulation experiments, we find that Thompson Sampling outperforms a number of alternative proposals, including methods specifically designed for settings with delayed feedback.
翻译:我们调查Thompson抽样调查在多武装盗匪问题中与拖延反馈问题有关的特点。在出现拖延的环境下,我们知道Thompson抽样的首个遗憾界限是任意拖延分发的,包括无限制预期的。 我们的界限在质量上可以与通过特设算法得出的现有最佳界限相比,并且只取决于延迟分发的选定数的延误。 此外,在广泛的模拟实验中,我们发现Thompson抽样比一些备选提案要好,包括专门为有延迟反馈的环境设计的方法。