On typical modern platforms, users are only able to try a small fraction of the available items. This makes it difficult to model the exploration behavior of platform users as typical online learners who explore all the items. Towards addressing this issue, we propose to interpret a recommender system as a bandit exploration coordinator that provides counterfactual information updates. In particular, we introduce a novel algorithm called Counterfactual UCB (CFUCB) which is guarantees user exploration coordination with bounded regret under the presence of linear representations. Our results show that sharing information is a Subgame Perfect Nash Equilibrium for agents in terms of regret, leading to each agent achieving bounded regret. This approach has potential applications in personalized recommender systems and adaptive experimentation.
翻译:在典型的现代平台上,用户只能尝试一小部分可用物品。 这使得很难将平台用户的探索行为作为典型的在线学习者来模拟,以探索所有物品。 为了解决这一问题,我们提议将推荐者系统解释为提供反事实信息更新的土匪勘探协调员。 特别是,我们引入了名为 " 反事实UCB(CFUBB) " 的新奇算法,该算法在线性演示下保证用户探索协调与受约束的遗憾协调。 我们的结果表明,共享信息是代理方的子游戏完美 Nash 平衡,在遗憾方面导致每个代理方都获得受约束的遗憾。 这种方法在个性化推荐系统和适应性实验中具有潜在的应用性。