In many online decision processes, the optimizing agent is called to choose between large numbers of alternatives with many inherent similarities; in turn, these similarities imply closely correlated losses that may confound standard discrete choice models and bandit algorithms. We study this question in the context of nested bandits, a class of adversarial multi-armed bandit problems where the learner seeks to minimize their regret in the presence of a large number of distinct alternatives with a hierarchy of embedded (non-combinatorial) similarities. In this setting, optimal algorithms based on the exponential weights blueprint (like Hedge, EXP3, and their variants) may incur significant regret because they tend to spend excessive amounts of time exploring irrelevant alternatives with similar, suboptimal costs. To account for this, we propose a nested exponential weights (NEW) algorithm that performs a layered exploration of the learner's set of alternatives based on a nested, step-by-step selection method. In so doing, we obtain a series of tight bounds for the learner's regret showing that online learning problems with a high degree of similarity between alternatives can be resolved efficiently, without a red bus / blue bus paradox occurring.
翻译:在许多在线决策过程中,要求优化剂在众多具有许多内在相似之处的替代品之间作出选择;反过来,这些相似之处意味着密切相关的损失,可能混淆标准的离散选择模式和土匪算法。我们从筑巢强盗的角度研究这一问题,这是一种对抗性多武装强盗问题,在存在大量不同替代品且嵌入式(非compinal)相似性等级(非combinal)的情形下,学习者试图最大限度地减少他们的遗憾。在这种背景下,基于指数重蓝图(如Hedge、EXP3及其变异性)的最佳算法可能会引起极大的遗憾,因为它们往往花费过多的时间探索不相关的替代品,同时付出类似的、最优的成本。为此,我们提议采用嵌巢式指数算法,根据嵌巢式、逐步选择方法对学习者的整套替代方法进行层层探索。在这样做时,我们得到了一系列最紧凑的算法,表明学习者遗憾地表明,在选择方法之间高度相似性的在线学习问题可以有效解决,而没有发生红蓝色总车/公共汽车的悖论。