We here adopt Bayesian nonparametric mixture models to extend multi-armed bandits in general, and Thompson sampling in particular, to scenarios where there is reward model uncertainty. In the stochastic multi-armed bandit, the reward for the played arm is generated from an unknown distribution. Reward uncertainty, i.e., the lack of knowledge about the reward-generating distribution, induces the exploration-exploitation trade-off: a bandit agent needs to simultaneously learn the properties of the reward distribution and sequentially decide which action to take next. In this work, we extend Thompson sampling to scenarios where there is reward model uncertainty by adopting Bayesian nonparametric Gaussian mixture models for flexible reward density estimation. The proposed Bayesian nonparametric mixture model Thompson sampling sequentially learns the reward model that best approximates the true, yet unknown, per-arm reward distribution, achieving successful regret performance. We derive, based on a novel posterior convergence based analysis, an asymptotic regret bound for the proposed method. In addition, we empirically evaluate its performance in diverse and previously elusive bandit environments, e.g., with rewards not in the exponential family, subject to outliers, and with different per-arm reward distributions. We show that the proposed Bayesian nonparametric Thompson sampling outperforms, both in averaged cumulative regret and in regret volatility, state-of-the-art alternatives. The proposed method is valuable in the presence of bandit reward model uncertainty, as it avoids stringent case-by-case model design choices, yet provides important regret savings.
翻译:我们在这里采用巴耶斯非对称混合模型,以扩大多武装匪徒的总体性质,特别是汤普森抽样,以扩大汤普森抽样,使之适用于有奖励模式不确定性的情景。在沙沙地多武装匪徒中,对打斗手臂的奖励来自未知的分布。奖赏不确定性,即缺乏对奖励产生分配的了解,导致勘探-开发交易:土匪代理人需要同时了解奖赏分配的特性,并按顺序决定下一步要采取的行动。在这项工作中,我们将汤普森抽样扩大到有奖赏模式不确定性的情景,采用巴伊西亚非对称高斯混合模型进行灵活的奖赏密度估计。拟议的巴伊斯非对准混合模型按顺序抽样抽样学习奖赏模式,该模式最接近真实的、但未知的每支奖赏分配,并成功地取得遗憾表现。我们根据基于新颖的奖赏组合分析,为拟议方法的微调悔。此外,我们通过经验评估其在不同和先前难以捉摸的悬浮不定的悬浮模型,例如:在标准中,我们以指数式的方式展示了不相平级方法的奖得分。