Users of recommender systems often behave in a non-stationary fashion, due to their evolving preferences and tastes over time. In this work, we propose a practical approach for fast personalization to non-stationary users. The key idea is to frame this problem as a latent bandit, where the prototypical models of user behavior are learned offline and the latent state of the user is inferred online from its interactions with the models. We call this problem a non-stationary latent bandit. We propose Thompson sampling algorithms for regret minimization in non-stationary latent bandits, analyze them, and evaluate them on a real-world dataset. The main strength of our approach is that it can be combined with rich offline-learned models, which can be misspecified, and are subsequently fine-tuned online using posterior sampling. In this way, we naturally combine the strengths of offline and online learning.
翻译:推荐人的系统用户往往由于他们的偏好和口味随时间变化而以非静止的方式行事。 在这项工作中,我们提出了快速个性化的非固定用户的实用方法。 关键的想法是将这一问题描述为潜伏的土匪, 在那里,用户行为的原型模型在离线学习, 用户与模型的互动在网上被推断出潜在的状态。 我们将此问题称为非静止的潜在土匪。 我们提出汤普森抽样算法, 以便在非静止的潜在土匪中遗憾最小化, 分析这些算法, 并在真实世界的数据集中评估它们。 我们方法的主要优点是, 它可以与丰富的离线外学习模型相结合, 而这些模型可以被错误地描述, 并随后通过事后取样对在线进行微调。 这样, 我们自然地将离线和在线学习的优势结合起来。