The Asymptotic Randomised Control (ARC) algorithm provides a rigorous approximation to the optimal strategy for a wide class of Bayesian bandits, while retaining reasonable computational complexity. In particular, it allows a decision maker to observe signals in addition to their rewards, to incorporate correlations between the outcomes of different choices, and to have nontrivial dynamics for their estimates. The algorithm is guaranteed to asymptotically optimise the expected discounted payoff, with error depending on the initial uncertainty of the bandit. In this paper, we consider a batched bandit problem where observations arrive from a generalised linear model; we extend the ARC algorithm to this setting. We apply this to a classic dynamic pricing problem based on a Bayesian hierarchical model and demonstrate that the ARC algorithm outperforms alternative approaches.
翻译:Asymptic Randicized Controlation (ARC) 算法为广大的贝叶斯人土匪提供了最优策略的严格近似值,同时保留了合理的计算复杂性。 特别是,它允许决策者观察信号,除了其奖赏之外,还可以观察信号,纳入不同选择结果之间的关联,并且对其估计有非三维的动态。 该算法保证无一例外地优化预期的折扣回报,错误取决于土匪最初的不确定性。 在本文中,我们考虑到一个分批的土匪问题,其观察来自一个通用线性模型;我们将ARC算法扩大到这一环境。我们将此应用于基于巴伊斯人等级模型的典型动态定价问题,并证明ARC算法优于其他方法。