In this paper, we study the batched Lipschitz bandit problem, where the expected reward is Lipschitz and the reward observations are collected in batches. We introduce a novel landscape-aware algorithm, called Batched Lipschitz Narrowing (BLiN), that naturally fits into the batched feedback setting. In particular, we show that for a $T$-step problem with Lipschitz reward of zooming dimension $d_z$, our algorithm achieves theoretically optimal regret rate of $ \widetilde{\mathcal{O}} \left( T^{\frac{d_z + 1}{d_z + 2}} \right) $ using only $ \mathcal{O} \left( \log\log T\right) $ batches. For the lower bound, we show that in an environment with $B$-batches, for any policy $\pi$, there exists a problem instance such that the expected regret is lower bounded by $ \widetilde{\Omega} \left(R_z(T)^\frac{1}{1-\left(\frac{1}{d+2}\right)^B}\right) $, where $R_z (T)$ is the regret lower bound for vanilla Lipschitz bandits that depends on the zooming dimension $d_z$, and $d$ is the dimension of the arm space. As a direct consequence, $B=\Omega(\log\log T)$ batches are needed to achieve the regret lower bound, and BLiN algorithm is optimal.
翻译:在本文中, 我们研究分批的Lipschitz盗匪问题, 期望的奖赏是Lipschitz, 奖励的观察是分批收集的。 我们引入了一个叫Batched Lipschitz 的新的景观觉悟算法, 叫做 Batched Lipschitz narrowing( BliNNNN), 这自然适合分批反馈设置。 特别是, 我们显示, 对于利普施茨对放大规模奖励的分级问题, 我们的算法在理论上达到了 $ (d)z 的最好的遗憾率, 美元 (OZ) + 1 ⁇ d_ z + 2\\\\ right) $ (T\\\\\\\\\\\\\\\\\\ drxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx