We consider the problem of sequential recommendations, where at each step an agent proposes some slate of $N$ distinct items to a user from a much larger catalog of size $K>>N$. The user has unknown preferences towards the recommendations and the agent takes sequential actions that optimise (in our case minimise) some user-related cost, with the help of Reinforcement Learning. The possible item combinations for a slate is $\binom{K}{N}$, an enormous number rendering value iteration methods intractable. We prove that the slate-MDP can actually be decomposed using just $K$ item-related $Q$ functions per state, which describe the problem in a more compact and efficient way. Based on this, we propose a novel model-free SARSA and Q-learning algorithm that performs $N$ parallel iterations per step, without any prior user knowledge. We call this method \texttt{SlateFree}, i.e. free-of-slates, and we show numerically that it converges very fast to the exact optimum for arbitrary user profiles, and that it outperforms alternatives from the literature.
翻译:我们考虑的是顺序建议的问题, 每一步, 代理商都会向来自规模大得多的商品目录的用户建议一些美元的特殊项目。 用户对建议有未知的偏好, 代理商会采取一系列行动, 优化( 以最小化方式) 某些与用户相关的成本, 并在加强学习的帮助下 。 一个板的可能项目组合是$\binom{K ⁇ N}$, 是一个巨大的数字转换方法。 我们证明, 日期- MDP 实际上可以使用每个州仅1K美元的项目相关 $Q 函数来解构, 以更紧凑、更高效的方式描述问题。 基于这一点, 我们提出一个新的无型SASAA 和Q学习算法, 以每步执行平行的 $ 。 我们称之为 $ textt{ latfret}, 也就是免费的转换方法。 我们用数字来显示它与任意用户剖面图的精确最佳化方法相近。