This paper explores multi-armed bandit (MAB) strategies in very short horizon scenarios, i.e., when the bandit strategy is only allowed very few interactions with the environment. This is an understudied setting in the MAB literature with many applications in the context of games, such as player modeling. Specifically, we pursue three different ideas. First, we explore the use of regression oracles, which replace the simple average used in strategies such as epsilon-greedy with linear regression models. Second, we examine different exploration patterns such as forced exploration phases. Finally, we introduce a new variant of the UCB1 strategy called UCBT that has interesting properties and no tunable parameters. We present experimental results in a domain motivated by exergames, where the goal is to maximize a player's daily steps. Our results show that the combination of epsilon-greedy or epsilon-decreasing with regression oracles outperforms all other tested strategies in the short horizon setting.
翻译:本文在非常短的地平线情景中探索多武装土匪(MAB)战略,即当强盗战略仅允许很少与环境互动时。 这是MAB文献中一个研究不足的场景,在游戏中有许多应用,例如玩家模型。 具体地说, 我们追求三种不同的想法。 首先, 我们探索回归或魔爪的使用, 以线性回归模型取代诸如epsilon- greedy 等战略中使用的简单平均数。 其次, 我们考察不同勘探模式, 如强迫勘探阶段。 最后, 我们引入了名为 UCB1 战略的新变体, 叫做 UCBBT, 具有有趣的属性和无金枪鱼参数。 我们展示了由外星驱动的域的实验结果, 目的是最大限度地提高玩家的日常步骤。 我们的结果表明, 后方- greedy 或 epsilon- decreasing 等战略的组合, 超越了短地平线环境中所有其他经过测试的战略 。