We study the stochastic Multi-Armed Bandit (MAB) problem with random delays in the feedback received by the algorithm. We consider two settings: the reward-dependent delay setting, where realized delays may depend on the stochastic rewards, and the reward-independent delay setting. Our main contribution is algorithms that achieve near-optimal regret in each of the settings, with an additional additive dependence on the quantiles of the delay distribution. Our results do not make any assumptions on the delay distributions: in particular, we do not assume they come from any parametric family of distributions and allow for unbounded support and expectation; we further allow for infinite delays where the algorithm might occasionally not observe any feedback.
翻译:我们研究的是随机拖延算法收到的反馈的多武装盗匪(MAB)问题。我们考虑了两个环境:取决于报酬的延迟设定,因为实际的延迟可能取决于随机的奖励,以及取决于报酬的延迟设定。我们的主要贡献是在每个环境中实现近于最佳的遗憾的算法,对延迟分布的四分位数还有额外的添加剂依赖。我们的结果对延迟分布不作任何假设:特别是,我们不认为它们来自任何分布的参数式组合,允许不受限制的支持和期待;我们进一步允许在算法偶尔可能看不到任何反馈的情况下出现无限的延迟。