We consider the problem introduced by \cite{Mason2020} of identifying all the $\varepsilon$-optimal arms in a finite stochastic multi-armed bandit with Gaussian rewards. In the fixed confidence setting, we give a lower bound on the number of samples required by any algorithm that returns the set of $\varepsilon$-good arms with a failure probability less than some risk level $\delta$. This bound writes as $T_{\varepsilon}^*(\mu)\log(1/\delta)$, where $T_{\varepsilon}^*(\mu)$ is a characteristic time that depends on the vector of mean rewards $\mu$ and the accuracy parameter $\varepsilon$. We also provide an efficient numerical method to solve the convex max-min program that defines the characteristic time. Our method is based on a complete characterization of the alternative bandit instances that the optimal sampling strategy needs to rule out, thus making our bound tighter than the one provided by \cite{Mason2020}. Using this method, we propose a Track-and-Stop algorithm that identifies the set of $\varepsilon$-good arms w.h.p and enjoys asymptotic optimality (when $\delta$ goes to zero) in terms of the expected sample complexity. Finally, using numerical simulations, we demonstrate our algorithm's advantage over state-of-the-art methods, even for moderate values of the risk parameter.
翻译:我们考虑由\cite{Mason2020} 所引入的问题, 即用高斯奖赏来识别所有以瓦列普西隆$- 最佳武器, 在一个有限的随机多武装土匪中, 使用高斯奖赏。 在固定的信任设置中, 我们给返回美元- 瓦列普西隆$- 好武器组合的任何算法所需的样本数量设定一个较低的限制, 失败概率小于某种风险水平 $delta$。 这个绑定写为 $+ ⁇ varepsilon}( mu)\log log (1/\delta) $, 其中, $$+ valepsilon ⁇ (\\ mu) 是取决于平均奖赏的矢量和精度参数 $\ varepslon 。 我们还提供了一种有效的数字方法, 来解析确定典型时间。 我们的方法是基于对替代土匪式的完整描述, 最佳采样战略需要排除的情景, 从而使我们的绑定比 美元- 美元- 20} 的精度(tredustricality) ralalalalal- assalalalalalalation asmaxismaxxxx 。 我们提议采用一个最佳武器定义, 最后使用一个最佳的轨道- squalxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx。