具有半银行式反馈的组合强盗高效理论探索 (Efficient Pure Exploration for Combinatorial Bandits with Semi-Bandit Feedback)

from arxiv, 45 pages. 3 tables. Appendices: from A to I. Figures: 1(a), 1(b), 2(a), 2(b), 3(a), 3(b), 3(c), 4(a), 4(b), 5(a), 5(b), 5(c), 5(d), 6(a), 6(b). To be published in the 32nd International Conference on Algorithmic Learning Theory and the Proceedings of Machine Learning Research vol 132:1-45, 2021

Combinatorial bandits with semi-bandit feedback generalize multi-armed bandits, where the agent chooses sets of arms and observes a noisy reward for each arm contained in the chosen set. The action set satisfies a given structure such as forming a base of a matroid or a path in a graph. We focus on the pure-exploration problem of identifying the best arm with fixed confidence, as well as a more general setting, where the structure of the answer set differs from the one of the action set. Using the recently popularized game framework, we interpret this problem as a sequential zero-sum game and develop a CombGame meta-algorithm whose instances are asymptotically optimal algorithms with finite time guarantees. In addition to comparing two families of learners to instantiate our meta-algorithm, the main contribution of our work is a specific oracle efficient instance for best-arm identification with combinatorial actions. Based on a projection-free online learning algorithm for convex polytopes, it is the first computationally efficient algorithm which is asymptotically optimal and has competitive empirical performance.

翻译：混合强盗与半土匪的半土匪反馈一般化为多手强盗, 代理商选择各套武器, 并观察到对所选集中每只手臂的响亮奖赏。动作集符合一个特定结构, 如组成一个机器人基或图中路径。我们集中关注一个纯粹探索的问题, 即用固定的自信来识别最好的手臂, 以及一个更笼统的设置, 答案集的结构与行动集的结构不同。我们使用最近流行的游戏框架, 将这个问题解释为一个连续零和游戏, 并开发一个CombGame 元和数的元数法, 其实例在微小的程度上是最佳的, 且有有限时间保证。除了比较两个学习者的家庭来即时, 我们工作的主要贡献是使用组合动作来进行最佳武器识别的具体或触法效率实例。基于对调聚体的无预测在线学习算法, 它是第一个计算高效的算法, 其过程是尽可能最佳和有竞争性的经验性表现。