多武装背景土匪的变式推论 (Variational inference for the multi-armed contextual bandit)

In many biomedical, science, and engineering problems, one must sequentially decide which action to take next so as to maximize rewards. One general class of algorithms for optimizing interactions with the world, while simultaneously learning how the world operates, is the multi-armed bandit setting and, in particular, the contextual bandit case. In this setting, for each executed action, one observes rewards that are dependent on a given 'context', available at each interaction with the world. The Thompson sampling algorithm has recently been shown to enjoy provable optimality properties for this set of problems, and to perform well in real-world settings. It facilitates generative and interpretable modeling of the problem at hand. Nevertheless, the design and complexity of the model limit its application, since one must both sample from the distributions modeled and calculate their expected rewards. We here show how these limitations can be overcome using variational inference to approximate complex models, applying to the reinforcement learning case advances developed for the inference case in the machine learning community over the past two decades. We consider contextual multi-armed bandit applications where the true reward distribution is unknown and complex, which we approximate with a mixture model whose parameters are inferred via variational inference. We show how the proposed variational Thompson sampling approach is accurate in approximating the true distribution, and attains reduced regrets even with complex reward distributions. The proposed algorithm is valuable for practical scenarios where restrictive modeling assumptions are undesirable.

翻译：在许多生物医学、科学和工程问题中,人们必须依次决定下一步要采取什么行动,以获得最大限度的回报。在学习世界如何运作的同时,一个优化与世界互动的普通算法类别是多武装土匪环境,特别是背景土匪案件。在这个环境中,对于每个执行的行动,人们观察取决于每个与世界互动时可获得的某个特定“通情”的奖赏。最近,汤普森抽样算法显示,在一系列问题中,可以享受可辨别的最佳性,并在现实世界环境中运行良好。它有利于对当前问题进行基因化和可解释的建模。尽管如此,模型的设计及其复杂性限制了其应用,因为必须同时从模型的分布样本中抽取样本,并计算其预期的奖赏。我们在这里展示了如何利用变化性推导法来克服这些限制,以大约复杂的模型为基础,运用于机器复杂学习社区的推论,在过去20年中,我们考虑了背景多武装的施压应用情况,其中真正的奖赏分配是未知和复杂的。我们用精确的估测算法推算出的是,在正确的分配模型中,我们用精确的推算出了精确的推理的推算的推算的推算方法,我们是如何推算出了精确的推算的。