This paper investigates the problem of combinatorial multiarmed bandits with stochastic submodular (in expectation) rewards and full-bandit delayed feedback, where the delayed feedback is assumed to be composite and anonymous. In other words, the delayed feedback is composed of components of rewards from past actions, with unknown division among the sub-components. Three models of delayed feedback: bounded adversarial, stochastic independent, and stochastic conditionally independent are studied, and regret bounds are derived for each of the delay models. Ignoring the problem dependent parameters, we show that regret bound for all the delay models is $\tilde{O}(T^{2/3} + T^{1/3} \nu)$ for time horizon $T$, where $\nu$ is a delay parameter defined differently in the three cases, thus demonstrating an additive term in regret with delay in all the three delay models. The considered algorithm is demonstrated to outperform other full-bandit approaches with delayed composite anonymous feedback.
翻译:----
本文研究了具有随机子模拉摩尔(在期望意义下)奖励和完全赌博机延迟反馈的组合多臂赌博机问题,其中延迟反馈被假设为复合和匿名。换句话说,延迟反馈由过去行动的奖励组成,其中未知将这些子组件分成哪些部分。研究了三个延迟反馈模型:有界对手性,随机独立和随机有条件独立,并为每种延迟模型导出了遗憾界。忽略问题相关的参数,我们表明对于所有延迟模型的遗憾界为$\tilde{O}(T^{2/3}+T^{1/3}\nu)$,其中$T$为时间跨度,$\nu$在三种情况下有不同定义,因此证明了在所有三种延迟模型中,遗憾有一个随延迟的附加项。证明算法表现优于其他具有延迟复合匿名反馈的全赌博机方法。