In Batched Multi-Armed Bandits (BMAB), the policy is not allowed to be updated at each time step. Usually, the setting asserts a maximum number of allowed policy updates and the algorithm schedules them so that to minimize the expected regret. In this paper, we describe a novel setting for BMAB, with the following twist: the timing of the policy update is not controlled by the BMAB algorithm, but instead the amount of data received during each batch, called \textit{crowd}, is influenced by the past selection of arms. We first design a near-optimal policy with approximate knowledge of the parameters that we prove to have a regret in $\mathcal{O}(\sqrt{\frac{\ln x}{x}}+\epsilon)$ where $x$ is the size of the crowd and $\epsilon$ is the parameter error. Next, we implement a UCB-inspired algorithm that guarantees an additional regret in $\mathcal{O}\left(\max(K\ln T,\sqrt{T\ln T})\right)$, where $K$ is the number of arms and $T$ is the horizon.
翻译:在 Bashched 多Armed 强盗( BBAB) 中, 政策不允许在每次步骤中更新。 通常, 设定会显示允许政策更新的最大数量和算法安排, 以便最大限度地减少预期的遗憾。 在本文中, 我们描述 BMAB 的一个新设置, 并有以下曲折: 政策更新的时间不受 BMAB 算法控制, 而每批中收到的被称为\ textit{ crowd} 的数据数量则受过去选择的军备的影响。 我们首先设计一个近于最佳的政策, 大致知道我们在 $\ mathcal{O} (\ qrt\\\ t} (\ sqrt\xxx ⁇ x epsilon) 中表示遗憾的参数, $x美元是人群的大小, $\\ exsilon 是一个参数错误。 下一步, 我们实施一个由UCB 启发的算法, 保证在$mathcal {O ⁇ lef(\ max)\ $ $ 和 $美元之间。