The multi-armed bandit (MAB) model is one of the most classical models to study decision-making in an uncertain environment. In this model, a player needs to choose one of K possible arms of a bandit machine to play at each time step, where the corresponding arm returns a random reward to the player, potentially from a specific unknown distribution. The target of the player is to collect as much rewards as possible during the process. Despite its simplicity, the MAB model offers an excellent playground for studying the trade-off between exploration versus exploitation and designing effective algorithms for sequential decision-making under uncertainty. Although many asymptotically optimal algorithms have been established, the finite-time behaviours of the stochastic dynamics of the MAB model appears much more difficult to analyze, due to the intertwining between the decision-making and the rewards being collected. In this paper, we employ techniques in statistical physics to analyze the MAB model, which facilitates to characterize the distribution of cumulative regrets at a finite short time, the central quantity of interest in an MAB algorithm, as well as the intricate dynamical behaviours of the model.
翻译:多武装土匪模式(MAB)模式是研究在不确定环境中决策的最经典模式之一。 在这个模式中,玩家需要选择每步每步播放的强盗机器的K型武器之一,相应的手臂可能从特定的未知分布中随机回报给玩家。玩家的目标是在这一过程中尽可能多地收集奖励。尽管这个模式很简单,但它为研究勘探与剥削之间的权衡提供了极好的游乐场,并为不确定情况下的顺序决策设计了有效的算法。虽然已经建立了许多尽可能有现机的最佳算法,但是由于决策与收集的奖励之间的相互交错,MAB模式的随机动态的定时行为似乎更难分析。在本文中,我们使用了统计物理技术来分析MAB模式,这有利于在有限的短时间内描述累积遗憾的分布、MAB算法的核心利益数量以及模型的复杂动态行为。