批量线性范围内土匪几乎最佳批量- 区域权衡 (Almost Optimal Batch-Regret Tradeoff for Batch Linear Contextual Bandits)

We study the optimal batch-regret tradeoff for batch linear contextual bandits. For any batch number $M$, number of actions $K$, time horizon $T$, and dimension $d$, we provide an algorithm and prove its regret guarantee, which, due to technical reasons, features a two-phase expression as the time horizon $T$ grows. We also prove a lower bound theorem that surprisingly shows the optimality of our two-phase regret upper bound (up to logarithmic factors) in the \emph{full range} of the problem parameters, therefore establishing the exact batch-regret tradeoff. Compared to the recent work \citep{ruan2020linear} which showed that $M = O(\log \log T)$ batches suffice to achieve the asymptotically minimax-optimal regret without the batch constraints, our algorithm is simpler and easier for practical implementation. Furthermore, our algorithm achieves the optimal regret for all $T \geq d$, while \citep{ruan2020linear} requires that $T$ greater than an unrealistically large polynomial of $d$. Along our analysis, we also prove a new matrix concentration inequality with dependence on their dynamic upper bounds, which, to the best of our knowledge, is the first of its kind in literature and maybe of independent interest.

翻译：我们研究的是分批线性背景土匪的最佳批量- regret交易。对于任何批量编号为$M$、行动数量为$K美元、时间范围$T$和维度$D$,我们提供算法并证明其遗憾保证,由于技术原因,随着时间范围$T美元的增长,这种算法具有两个阶段的表达方式。我们还证明了一个较低的约束性理论,令人惊讶地显示了我们问题参数中两阶段遗憾上限(达到对数系数)的最佳性。因此,对于任何批量编号为$、行动数量为$K美元、时间范围为$T和维度为$D$D$D$D$,与最近的工作\citep{ruan2020linear}相比,我们提供了一种算法的算法并证明,$M=O(log\log) T)分批量足以在没有批量限制的情况下实现无谓的微小负负数的遗憾,我们的算法更便于实际实施。此外,我们的算法首先对所有美元=cregretal deal ral main (caly) main a grealally hustly hustly exligaltistraltiquen.