We here adopt Bayesian nonparametric mixture models to extend multi-armed bandits in general, and Thompson sampling in particular, to scenarios where there is reward model uncertainty. In the stochastic multi-armed bandit, where an agent must learn a policy that maximizes long term payoff, the reward for the selected action is generated from an unknown distribution. Thompson sampling is a generative and interpretable multi-armed bandit algorithm that has been shown both to perform well in practice, and to enjoy optimality properties for certain reward functions. Nevertheless, Thompson sampling requires knowledge of the true reward model, for calculation of expected rewards and sampling from its parameter posterior. In this work, we extend Thompson sampling to complex scenarios where there is model uncertainty, by adopting a very flexible set of reward distributions: Bayesian nonparametric Gaussian mixture models. The generative process of Bayesian nonparametric mixtures naturally aligns with the Bayesian modeling of multi-armed bandits: the nonparametric model autonomously determines its complexity as new rewards are observed for the played arms. By characterizing each arm's reward distribution with independent nonparametric mixture models, the proposed method sequentially learns the model that best approximates the true underlying reward distribution, achieving successful performance in complex -- not in the exponential family -- bandits. Our contribution is valuable for practical scenarios, as it avoids stringent case-by-case model specifications and hyperparameter tuning, yet attains reduced regret in diverse bandit settings.
翻译:我们在这里采用巴伊西亚非参数混合模型,以扩大多武装匪徒的总体范围,特别是汤普森抽样,将多武装匪徒扩大到有奖赏模式不确定性的情景中。在沙沙多武装匪徒中,代理人必须学习最大限度长期报酬的政策,选定行动的奖赏来自未知分布。汤普森取样是一种可喜化和可解释的多武装匪盗算法,在实践上表现良好,并享受某些奖赏功能的最佳性能。然而,汤普森取样需要了解真正的奖赏模式,以计算其参数后方的预期奖赏和抽样。在这项工作中,我们通过采用非常灵活的奖赏分配模式,将汤普森抽样扩大到存在模型不确定性的复杂情景中:巴伊西亚非参数高频混合模型的奖赏。巴伊斯非参数混合模型的感化过程自然与巴伊西亚的多武装盗匪模型模式相匹配:非参数模型自主地决定其复杂性,因为观察到被玩弄的武器的新环境。通过将每只手臂奖赏模型的奖赏分配与独立、非参数性、实际的比差的混合模型,在高等级模型中,在最精确的等级模型中,为最佳的排序分析方法,在排序模型中进行。最佳的评分方法,在排序中,在排序分析中,在排序中学习中进行。最佳的评分方法方法方法,在排序中进行。