We consider the best-k-arm identification problem for multi-armed bandits, where the objective is to select the exact set of k arms with the highest mean rewards by sequentially allocating measurement effort. We characterize the necessary and sufficient conditions for the optimal allocation using dual variables. Remarkably these optimality conditions lead to the extension of top-two algorithm design principle (Russo, 2020), initially proposed for best-arm identification. Furthermore, our optimality conditions induce a simple and effective selection rule dubbed information-directed selection (IDS) that selects one of the top-two candidates based on a measure of information gain. As a theoretical guarantee, we prove that integrated with IDS, top-two Thompson sampling is (asymptotically) optimal for Gaussian best-arm identification, solving a glaring open problem in the pure exploration literature (Russo, 2020). As a by-product, we show that for k > 1, top-two algorithms cannot achieve optimality even with an oracle tuning parameter. Numerical experiments show the superior performance of the proposed top-two algorithms with IDS and considerable improvement compared with algorithms without adaptive selection.
翻译:我们考虑的是多武装匪徒最优秀的K-武器识别问题,目标是通过按顺序分配测量努力,选择准确的K-武器,以最高平均回报为最高比例。我们确定使用双重变量进行最佳分配的必要和充分条件。这些最佳条件明显导致扩大最初为最佳武器识别而提出的前二级算法设计原则(Russo,2020年)。此外,我们的最佳性条件导致一种简单有效的选择规则,即根据信息收益的衡量标准,选择以信息为导向的最上二级候选人之一。作为一个理论保证,我们证明与IDS结合的上二级汤普森抽样(暂时)对高斯最佳武器识别最合适,解决纯勘探文献中一个明显的公开问题(Russo,2020年)。作为一个副产品,我们显示,对于 k > 1级,即使有某种触摸调参数,顶级2级算法也不可能达到最佳性。数量实验显示,提议的与IDS的顶级2级算法和与不作适应性选择的算法相比,高级改进是优的。