Contextual multi-armed bandits provide powerful tools to solve the exploitation-exploration dilemma in decision making, with direct applications in the personalized recommendation. In fact, collaborative effects among users carry the significant potential to improve the recommendation. In this paper, we introduce and study the problem by exploring `Neural Collaborative Filtering Bandits', where the rewards can be non-linear functions and groups are formed dynamically given different specific contents. To solve this problem, inspired by meta-learning, we propose Meta-Ban (meta-bandits), where a meta-learner is designed to represent and rapidly adapt to dynamic groups, along with a UCB-based exploration strategy. Furthermore, we analyze that Meta-Ban can achieve the regret bound of $\mathcal{O}(\sqrt{T \log T})$, improving a multiplicative factor $\sqrt{\log T}$ over state-of-the-art related works. In the end, we conduct extensive experiments showing that Meta-Ban significantly outperforms six strong baselines.
翻译:多武装土匪背景多武装强盗为解决决策中的剥削-探索困境提供了强有力的工具,个人化建议中直接应用了这一工具。事实上,用户之间的协作效应具有改进建议的巨大潜力。在本文中,我们提出并研究这一问题,探索“网络协作过滤强盗”,奖励可以是非线性功能,而团体则能动态地形成不同的具体内容。为了解决这一问题,我们提议在元学习的启发下,建立Meta-Ban(meta-bandities),在Meta-Ban(meta-bandities)中,设计出一个元激光器来代表并迅速适应动态群体,同时采用UCB勘探战略。此外,我们分析Meta-Ban(Meta-Ban)可以实现$\mathcal{O}(sqrt{T\log T}(sqr)$)的遗憾,从而改善一个多复制系数$\sqrt=log T}。我们进行了广泛的实验,结果显示Meta-Ban(meta-ban)大大超出六大基线。