Personalised interactive systems such as recommender systems require selecting relevant items dependent on context. Production systems need to identify the items rapidly from very large catalogues which can be efficiently solved using maximum inner product search technology. Offline optimisation of maximum inner product search can be achieved by a relaxation of the discrete problem resulting in policy learning or REINFORCE style learning algorithms. Unfortunately, this relaxation step requires computing a sum over the entire catalogue making the complexity of the evaluation of the gradient (and hence each stochastic gradient descent iterations) linear in the catalogue size. This calculation is untenable in many real world examples such as large catalogue recommender systems, severely limiting the usefulness of this method in practice. In this paper, we derive an excellent approximation of these policy learning algorithms that scale logarithmically with the catalogue size. Our contribution is based upon combining three novel ideas: a new Monte Carlo estimate of the gradient of a policy, the self normalised importance sampling estimator and the use of fast maximum inner product search at training time. Extensive experiments show that our algorithm is an order of magnitude faster than naive approaches yet produces equally good policies.
翻译:推荐者系统等个性化互动系统要求根据背景选择相关物品。 生产系统需要从大型目录中迅速确定能够使用最大内部产品搜索技术有效解决的物品, 离线优化最大内部产品搜索可以通过放松导致政策学习或REINFORCE风格学习算法的离散问题来实现。 不幸的是, 这一放松步骤要求在整个目录中计算一个总和,使对梯度(以及每个随机梯度梯度下降迭代)的评价变得复杂, 在目录尺寸上线性。 这一计算在许多真实的世界范例中是站不住脚的, 如大型目录推荐系统, 严重限制了这种方法在实践中的效用。 在本文中, 我们从这些政策学习算法中得出了一个极好的近似值, 其规模与目录大小成正比。 我们的贡献基于三个新概念的结合: 对政策梯度的新蒙特卡洛估计, 自我正常化的重要性测算器和在培训时使用快速最大内部产品搜索。 广泛的实验显示, 我们的算法比天真的方法要快得多, 但却产生同样良好的政策。