Personalised interactive systems such as recommender systems require selecting relevant items dependent on context. Production systems need to identify the items rapidly from very large catalogues which can be efficiently solved using maximum inner product search technology. Offline optimisation of maximum inner product search can be achieved by a relaxation of the discrete problem resulting in policy learning or reinforce style learning algorithms. Unfortunately this relaxation step requires computing a sum over the entire catalogue making the complexity of the evaluation of the gradient (and hence each stochastic gradient descent iterations) linear in the catalogue size. This calculation is untenable in many real world examples such as large catalogue recommender systems severely limiting the usefulness of this method in practice. In this paper we show how it is possible to produce an excellent approximation of these policy learning algorithms that scale logarithmically with the catalogue size. Our contribution is based upon combining three novel ideas: a new Monte Carlo estimate of the gradient of a policy, the self normalised importance sampling estimator and the use of fast maximum inner product search at training time. Extensive experiments show our algorithm is an order of magnitude faster than naive approaches yet produces equally good policies.
翻译:个人化互动系统,例如建议者系统,要求根据背景选择相关物品。生产系统需要从大型目录中迅速确定能够使用最大内部产品搜索技术有效解决的物品。通过放松导致政策学习或强化风格学习算法的离散问题,可以实现最大内部产品搜索的离线优化。不幸的是,这一放松步骤要求在整个目录中计算一个总和,使对梯度(以及因此每个随机梯度梯度下坡迭代)的评价变得复杂,在目录的大小上线。在许多真实的世界实例中,这种计算是站不住脚的,例如大型目录推荐系统严重限制了这种方法在实践中的效用。在本文件中,我们展示了如何对这些政策学习算法进行极佳的近似,这种算法与目录大小成正比。我们的贡献基于三个新概念的结合:对政策的梯度进行新的蒙特卡洛估计,自我正常重要性测算,以及在培训时使用快速最大内部产品搜索。广泛的实验显示我们的算法比天真的方法要快,但却产生同样良好的政策。