We consider online sequential decision problems where an agent must balance exploration and exploitation. We derive a set of Bayesian `optimistic' policies which, in the stochastic multi-armed bandit case, includes the Thompson sampling policy. We provide a new analysis showing that any algorithm producing policies in the optimistic set enjoys $\tilde O(\sqrt{AT})$ Bayesian regret for a problem with $A$ actions after $T$ rounds. We extend the regret analysis for optimistic policies to bilinear saddle-point problems which include zero-sum matrix games and constrained bandits as special cases. In this case we show that Thompson sampling can produce policies outside of the optimistic set and suffer linear regret in some instances. Finding a policy inside the optimistic set amounts to solving a convex optimization problem and we call the resulting algorithm `variational Bayesian optimistic sampling' (VBOS). The procedure works for any posteriors, \ie, it does not require the posterior to have any special properties, such as log-concavity, unimodality, or smoothness. The variational view of the problem has many useful properties, including the ability to tune the exploration-exploitation tradeoff, add regularization, incorporate constraints, and linearly parameterize the policy.
翻译:我们考虑的是代理人必须平衡勘探和开采的在线顺序决策问题。我们从一系列巴伊西亚“乐观”政策中得出了一系列“乐观”政策,在随机多臂强盗案中,这些政策包括汤普森抽样政策。我们提供了一项新的分析,表明乐观套套装中任何产生政策的算法都拥有美元tilde O(sqrt{AT})美元(Bayesian $Bayesian)对美元交易在T美元回合后产生的问题感到遗憾。我们将乐观政策的遗憾分析扩大到双线性垫点问题,包括零和矩阵游戏和受限制的土匪的特殊情况。我们在此案中表明,汤普森抽样可以产生乐观套件以外的政策,有时会受到线性遗憾。在乐观套内找到一个政策等于解决了康韦克斯优化问题,我们称由此产生的算法“变式Bayesian乐观取样”(VBOOS)。对于任何子孙、\i,其程序并不要求后期有任何特殊特性,例如日-康度、不便或平滑等。我们表明,汤普森抽样可以产生一些乐观的政策,而平稳的模型化的模型化观点包括了正常的探索能力,使贸易监管成为了许多有用的特性。