We study structured multi-armed bandits, which is the problem of online decision-making under uncertainty in the presence of structural information. In this problem, the decision-maker needs to discover the best course of action despite observing only uncertain rewards over time. The decision-maker is aware of certain structural information regarding the reward distributions and would like to minimize their regret by exploiting this information, where the regret is its performance difference against a benchmark policy that knows the best action ahead of time. In the absence of structural information, the classical upper confidence bound (UCB) and Thomson sampling algorithms are well known to suffer only minimal regret. As recently pointed out, neither algorithms are, however, capable of exploiting structural information that is commonly available in practice. We propose a novel learning algorithm that we call DUSA whose worst-case regret matches the information-theoretic regret lower bound up to a constant factor and can handle a wide range of structural information. Our algorithm DUSA solves a dual counterpart of the regret lower bound at the empirical reward distribution and follows its suggested play. Our proposed algorithm is the first computationally viable learning policy for structured bandit problems that has asymptotic minimal regret.
翻译:我们研究的是结构化的多武装强盗,这是在结构信息存在的情况下在不确定的情况下进行在线决策的问题。在这个问题中,决策者需要发现最佳的行动方针,尽管在一段时间里只看到不确定的奖励。决策者了解有关奖赏分配的某些结构性信息,并且希望通过利用这些信息来最大限度地减少他们的遗憾,在这方面,遗憾在于其业绩与一种能够预先知道最佳行动的基准政策存在差异。在缺乏结构性信息的情况下,传统的上层信任约束(UCB)和汤姆森抽样算法鲜为人知。正如最近指出的,这两种算法都无法利用实践中常见的结构性信息。我们提出一种新的学习算法,即我们称之为DUSA的最坏案例的遗憾与信息理论-理论遗憾相匹配,但与恒定因素相对应,能够处理广泛的结构性信息。我们的算法解决了经验性奖赏分配中较弱的遗憾的双重对应方,并遵循了它的建议的剧本。我们提议的算法是第一个计算上可行的结构性最低遗憾学习政策。