We study Thompson sampling (TS) in online decision-making problems where the uncertain environment is sampled from a mixture distribution. This is relevant to multi-task settings, where a learning agent is faced with different classes of problems. We incorporate this structure in a natural way by initializing TS with a mixture prior -- dubbed MixTS -- and develop a novel, general technique for analyzing the regret of TS with such priors. We apply this technique to derive Bayes regret bounds for MixTS in both linear bandits and tabular Markov decision processes (MDPs). Our regret bounds reflect the structure of the problem and depend on the number of components and confidence width of each component of the prior. Finally, we demonstrate the empirical effectiveness of MixTS in both synthetic and real-world experiments.
翻译:我们研究Thompson抽样(TS)的在线决策问题,其中不确定的环境是从混合物分布中抽样的。这与多任务环境有关,学习机构面临不同类别的问题。我们自然地将这种结构纳入其中,先用一种混合物 -- -- 被称为MixTS -- -- 启动TS,然后开发一种新颖的一般技术来分析具有这些前科的TS的悔恨。我们运用这种技术在线性土匪和表格式Markov决策过程中为MixTS得出贝德斯的遗憾界限。我们的遗憾界限反映了问题的结构,并取决于前一个组成部分的成分数量和信任度的宽度。最后,我们展示了MixTS在合成和现实世界实验中的实证效力。