In this work, we initiate the idea of using denoising diffusion models to learn priors for online decision making problems. Our special focus is on the meta-learning for bandit framework, with the goal of learning a strategy that performs well across bandit tasks of a same class. To this end, we train a diffusion model that learns the underlying task distribution and combine Thompson sampling with the learned prior to deal with new tasks at test time. Our posterior sampling algorithm is designed to carefully balance between the learned prior and the noisy observations that come from the learner's interaction with the environment. To capture realistic bandit scenarios, we also propose a novel diffusion model training procedure that trains even from incomplete and/or noisy data, which could be of independent interest. Finally, our extensive experimental evaluations clearly demonstrate the potential of the proposed approach.
翻译:在这项工作中,我们提出使用分解的传播模型来学习在线决策问题的前科。我们特别侧重于对土匪框架的元学习,目的是学习一种在同一个班级的土匪任务中很好地发挥作用的战略。为此,我们培训一种传播模型,以学习基本任务分布,并将汤普森抽样与试验时处理新任务之前所学到的样本结合起来。我们的后方取样算法旨在谨慎地平衡从学习者与环境的互动中学到的先前的和噪音观测。为了捕捉现实的土匪假想,我们还提出一种新的传播模型培训程序,甚至从不完整和(或)吵闹的数据中培训,这可能具有独立的兴趣。最后,我们广泛的实验性评估清楚地表明了拟议方法的潜力。