In today's technology environment, information is abundant, dynamic, and heterogeneous in nature. Automated filtering and prioritization of information is based on the distinction between whether the information adds substantial value toward one's goal or not. Contextual multi-armed bandit has been widely used for learning to filter contents and prioritize according to user interest or relevance. Learn-to-Rank technique optimizes the relevance ranking on items, allowing the contents to be selected accordingly. We propose a novel approach to top-K rankings under the contextual multi-armed bandit framework. We model the stochastic reward function with a neural network to allow non-linear approximation to learn the relationship between rewards and contexts. We demonstrate the approach and evaluate the the performance of learning from the experiments using real world data sets in simulated scenarios. Empirical results show that this approach performs well under the complexity of a reward structure and high dimensional contextual features.
翻译:在当今的技术环境中,信息是丰富、动态和多样的。信息自动过滤和优先排序基于信息是否对目标有实质性价值的区别。背景多武装土匪被广泛用于学习过滤内容,并根据用户的兴趣或相关性排列优先次序。学习到兰克技术优化了项目的相关等级,从而可以据此选择内容。我们提出了一种在背景多武装土匪框架下进行最高K级排名的新办法。我们用神经网络模拟随机评分功能,允许非线性近似来学习奖赏和背景之间的关系。我们展示了这种方法,并评估了在模拟情景中使用真实世界数据集进行实验的学习成绩。 " 经验 " 结果表明,这种方法在奖赏结构的复杂性和高维度背景特征下运作良好。