In the latent bandit problem, the learner has access to reward distributions and -- for the non-stationary variant -- transition models of the environment. The reward distributions are conditioned on the arm and unknown latent states. The goal is to use the reward history to identify the latent state, allowing for the optimal choice of arms in the future. The latent bandit setting lends itself to many practical applications, such as recommender and decision support systems, where rich data allows the offline estimation of environment models with online learning remaining a critical component. Previous solutions in this setting always choose the highest reward arm according to the agent's beliefs about the state, not explicitly considering the value of information-gathering arms. Such information-gathering arms do not necessarily provide the highest reward, thus may never be chosen by an agent that chooses the highest reward arms at all times. In this paper, we present a method for information-gathering in latent bandits. Given particular reward structures and transition matrices, we show that choosing the best arm given the agent's beliefs about the states incurs higher regret. Furthermore, we show that by choosing arms carefully, we obtain an improved estimation of the state distribution, and thus lower the cumulative regret through better arm choices in the future. We evaluate our method on both synthetic and real-world data sets, showing significant improvement in regret over state-of-the-art methods.
翻译:在潜伏的土匪问题中,学习者可以获得奖赏分配和 -- -- 对于非静止的变种 -- -- 环境的过渡模式。奖赏分配以手臂和未知的潜在国家为条件。目标是利用奖赏历史来识别潜在状态,从而允许今后最佳选择武器。潜伏的土匪环境有助于许多实际应用,例如建议和决策支持系统,其中丰富的数据允许对环境模型进行离线估计,在线学习仍是一个关键组成部分。在这个环境中,以前的解决方案总是根据代理人对国家的信念选择最高奖赏手臂,而不是明确考虑信息收集武器的价值。这种信息收集武器不一定提供最高奖赏,因此永远不可能由在任何时候选择最高奖赏武器的代理人选择。在这份文件中,我们提出了一个收集潜伏匪徒的信息的方法。特别是奖赏结构和过渡矩阵,我们表明,选择最好的武器是因为代理人对国家的信念,我们通过仔细选择武器,我们获得了对武器分配方法的更好估计,从而通过未来的遗憾度来显示我们实际武器分配方式的更好。