We study the problem of identifying the best arm in a stochastic multi-armed bandit game. Given a set of $n$ arms indexed from $1$ to $n$, each arm $i$ is associated with an unknown reward distribution supported on $[0,1]$ with mean $\theta_i$ and variance $\sigma_i^2$. Assume $\theta_1 > \theta_2 \geq \cdots \geq\theta_n$. We propose an adaptive algorithm which explores the gaps and variances of the rewards of the arms and makes future decisions based on the gathered information using a novel approach called \textit{grouped median elimination}. The proposed algorithm guarantees to output the best arm with probability $(1-\delta)$ and uses at most $O \left(\sum_{i = 1}^n \left(\frac{\sigma_i^2}{\Delta_i^2} + \frac{1}{\Delta_i}\right)(\ln \delta^{-1} + \ln \ln \Delta_i^{-1})\right)$ samples, where $\Delta_i$ ($i \geq 2$) denotes the reward gap between arm $i$ and the best arm and we define $\Delta_1 = \Delta_2$. This achieves a significant advantage over the variance-independent algorithms in some favorable scenarios and is the first result that removes the extra $\ln n$ factor on the best arm compared with the state-of-the-art. We further show that $\Omega \left( \sum_{i = 1}^n \left( \frac{\sigma_i^2}{\Delta_i^2} + \frac{1}{\Delta_i} \right) \ln \delta^{-1} \right)$ samples are necessary for an algorithm to achieve the same goal, thereby illustrating that our algorithm is optimal up to doubly logarithmic terms.
翻译:我们研究如何在多武装匪徒游戏中找到最好的臂膀。 鉴于一组由1美元到1美元不等的军械指数, 每一军械美元与以$[0, 1美元支持的未知的奖励分配相联, 平均为$\ta_ i 美元和差价$\ gma_ 美元。 假设$\ ta_ 1 >\ getq\ 美元= geq\ dd 美元 。 我们提议一个适应性算法, 探索军械收益的差距和差异, 并使用新颖的方法, 叫做\ textit{ 类平流目标消灭 。 拟议的算法保证以概率$( 1\ delta) 美元输出最好的臂, 最高值= 1\\\ left( left) 美元比( 美元) 最低值( = 1\\\ left) 美元比最低值( 美元) 和最低值( 美元) 最高值比數( = 美元)