PAC-Bayes has recently re-emerged as an effective theory with which one can derive principled learning algorithms with tight performance guarantees. However, applications of PAC-Bayes to bandit problems are relatively rare, which is a great misfortune. Many decision-making problems in healthcare, finance and natural sciences can be modelled as bandit problems. In many of these applications, principled algorithms with strong performance guarantees would be very much appreciated. This survey provides an overview of PAC-Bayes performance bounds for bandit problems and an experimental comparison of these bounds. Our experimental comparison has revealed that available PAC-Bayes upper bounds on the cumulative regret are loose, whereas available PAC-Bayes lower bounds on the expected reward can be surprisingly tight. We found that an offline contextual bandit algorithm that learns a policy by optimising a PAC-Bayes bound was able to learn randomised neural network polices with competitive expected reward and non-vacuous performance guarantees.
翻译:PAC-Bayes最近被重新作为一种有效的理论出现,人们可以借此获得有原则的学习算法,并且有严格的性能保障。然而,PAC-Bayes对强盗问题的应用相对较少,这是非常不幸的。在保健、金融和自然科学方面的许多决策问题可以模拟为强盗问题。在许多这些应用中,具有强力性能保障的原则算法将非常受人赞赏。这项调查提供了PAC-Bayes对强盗问题的性能约束的概览,并对这些界限进行了实验性比较。我们的实验性比较表明,累积遗憾的PAC-Bayes上限是松散的,而PAC-Bayes对预期奖励的下限则可能令人惊讶地紧张。我们发现,通过优化PAC-Bayes约束来学习政策的离线背景性强盗算算法能够学习具有竞争性预期奖励和不遗漏性性性性性业绩保证的随机神经网络警察。