以图表信息在多武装强盗中实现最大化和满意度 (Maximizing and Satisficing in Multi-armed Bandits with Graph Information)

Pure exploration in multi-armed bandits has emerged as an important framework for modeling decision-making and search under uncertainty. In modern applications, however, one is often faced with a tremendously large number of options. Even obtaining one observation per option may be too costly rendering traditional pure exploration algorithms ineffective. Fortunately, one often has access to similar relationships amongst the options that can be leveraged. In this paper, we consider the pure exploration problem in stochastic multi-armed bandits where the similarities between the arms are captured by a graph and the rewards may be represented as a smooth signal on this graph. In particular, we consider the problem of finding the arm with the maximum reward (i.e., the maximizing problem) or one with a sufficiently high reward (i.e., the satisficing problem) under this model. We propose novel algorithms \textbf{\algoname{}} (GRaph-based UcB) and $\zeta$-\textbf{\algoname{}} for these problems and provide a theoretical characterization of their performance which specifically elicits the benefit of the graph side information. We also prove a lower bound on the data requirement, showing a large class of problems where these algorithms are near-optimal. We complement our theory with experimental results that show the benefit of capitalizing on such side information.

翻译：多武装匪徒的纯粹探索已成为在不确定情况下模拟决策和搜索的重要框架。然而,在现代应用中,人们往往面临大量选择。即使每个选择获得一次观察,也可能代价太高,使传统的纯勘探算法无效。幸运的是,人们往往能够接触可加以利用的选项之间的类似关系。在本文中,我们认为在随机多武装匪徒中,武器之间的相似之处通过图表捕捉到,而奖赏可能作为该图上的一个顺利信号。特别是,我们考虑找到手臂的问题,以最大奖赏(即最大化问题)或一个奖赏足够高(即讽刺问题)的问题。我们提出新的算法 \ textbfalgoname\ (基于GRAph UcB) 和 $zeta$\ textbfalgoname ⁇ 的纯粹的勘探问题。我们用图表侧面信息来具体描述其性能,这特别能给图表侧信息带来好处(即最大化问题),或者在这个模型下,我们用实验性模型展示了我们这一类的大规模数据的好处。