We consider the problem of finding, through adaptive sampling, which of n arms (arms) has the largest mean. Our objective is to determine a rule which identifies the best arm with a fixed minimum confidence using as few observations as possible, i.e. fixed-confidence (FC) best arm identification (BAI) in multi-armed bandits. We study such problems under the Bayesian setting with both Bernoulli and Gaussian arms. We propose to use the classical vector at a time (VT) rule, which samples each remaining arm once in each round. We show how VT can be implemented and analyzed in our Bayesian setting and be improved by early elimination. Our analysis show that these algorithms guarantee an optimal strategy under the prior. We also propose and analyze a variant of the classical play the winner (PW) algorithm. Numerical results show that these rules compare favorably with state-of-art algorithms.
翻译:我们考虑通过适应性取样找到哪个武器(武器)具有最大平均值的问题。我们的目标是确定一条规则,用尽可能少的观察,即多武装匪徒的固定信心(FC)最佳武器识别(BAI),用尽可能少的观察,用固定的最小信任确定最好的武器。我们在Bernoulli和Gaussian的双臂的巴伊西亚背景下研究这些问题。我们提议在一个时间(VT)规则下使用古典矢量,每个时间(VT)规则中每个剩余武器都取样一次。我们展示了如何在我们巴伊西亚环境中执行并分析VT,并通过早期消除加以改进。我们的分析表明,这些算法保证了前一种最佳战略。我们还提出并分析了优胜者经典游戏算法的变式。数字结果显示,这些规则与最先进的算法相比是优的。