相关多武装多武装强盗中的最佳武器识别 (Best-Arm Identification in Correlated Multi-Armed Bandits)

In this paper we consider the problem of best-arm identification in multi-armed bandits in the fixed confidence setting, where the goal is to identify, with probability $1-\delta$ for some $\delta>0$, the arm with the highest mean reward in minimum possible samples from the set of arms $\mathcal{K}$. Most existing best-arm identification algorithms and analyses operate under the assumption that the rewards corresponding to different arms are independent of each other. We propose a novel correlated bandit framework that captures domain knowledge about correlation between arms in the form of upper bounds on expected conditional reward of an arm, given a reward realization from another arm. Our proposed algorithm C-LUCB, which generalizes the LUCB algorithm utilizes this partial knowledge of correlations to sharply reduce the sample complexity of best-arm identification. More interestingly, we show that the total samples obtained by C-LUCB are of the form $\mathcal{O}\left(\sum_{k \in \mathcal{C}} \log\left(\frac{1}{\delta}\right)\right)$ as opposed to the typical $\mathcal{O}\left(\sum_{k \in \mathcal{K}} \log\left(\frac{1}{\delta}\right)\right)$ samples required in the independent reward setting. The improvement comes, as the $\mathcal{O}(\log(1/\delta))$ term is summed only for the set of competitive arms $\mathcal{C}$, which is a subset of the original set of arms $\mathcal{K}$. The size of the set $\mathcal{C}$, depending on the problem setting, can be as small as $2$, and hence using C-LUCB in the correlated bandits setting can lead to significant performance improvements. Our theoretical findings are supported by experiments on the Movielens and Goodreads recommendation datasets.

翻译：在本文中, 我们考虑在固定的自信环境下, 多武装土匪中的最佳武器识别问题, 目标是以 $\ delta>0$ 的概率来识别, 在一组武器的最低样本中, 以最高平均奖励为最高比例的手臂 $\ mathcal{K} 美元。大多数现有的最佳武器识别算法和分析在假设不同手臂的奖赏是彼此独立的前提下运作。我们提出一个新的关联框架, 以预期的手臂有条件奖赏的上限形式, 获取关于手臂之间相关关系的域知识。我们提议的 C- LUCB 算法( 将 LUC\\\\ mal> mall_ 美元) 使用部分相关知识来大幅降低最佳武器识别的样本复杂性。更有趣的是, C- LUCB 获得的总样本可以以 $cal{ rental $_ rial (=xxxxxxxxxxxral_ maral} ma\\\\ rmarick roma_ rodeal roup listal romax list list list list ex exm exm exm $=x $=x $=xxx $=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx