强化学习以优化冷启动建议中的终生价值 (Reinforcement Learning to Optimize Lifetime Value in Cold-Start Recommendation)

Recommender system plays a crucial role in modern E-commerce platform. Due to the lack of historical interactions between users and items, cold-start recommendation is a challenging problem. In order to alleviate the cold-start issue, most existing methods introduce content and contextual information as the auxiliary information. Nevertheless, these methods assume the recommended items behave steadily over time, while in a typical E-commerce scenario, items generally have very different performances throughout their life period. In such a situation, it would be beneficial to consider the long-term return from the item perspective, which is usually ignored in conventional methods. Reinforcement learning (RL) naturally fits such a long-term optimization problem, in which the recommender could identify high potential items, proactively allocate more user impressions to boost their growth, therefore improve the multi-period cumulative gains. Inspired by this idea, we model the process as a Partially Observable and Controllable Markov Decision Process (POC-MDP), and propose an actor-critic RL framework (RL-LTV) to incorporate the item lifetime values (LTV) into the recommendation. In RL-LTV, the critic studies historical trajectories of items and predict the future LTV of fresh item, while the actor suggests a score-based policy which maximizes the future LTV expectation. Scores suggested by the actor are then combined with classical ranking scores in a dual-rank framework, therefore the recommendation is balanced with the LTV consideration. Our method outperforms the strong live baseline with a relative improvement of 8.67% and 18.03% on IPV and GMV of cold-start items, on one of the largest E-commerce platform.

翻译：建议系统在现代电子商务平台中发挥着关键作用。由于用户和项目之间缺乏历史互动,冷启动建议是一个具有挑战性的问题。为了缓解冷启动问题,大多数现有方法都采用内容和背景信息作为辅助信息。然而,这些方法假定建议的项目在一段时间内会保持稳定,而在典型的电子商务情景下,项目在整个生命周期中通常具有非常不同的性能。在这种情况下,从项目角度考虑长期回报(通常在常规方法中被忽视)将是有益的。强化学习(RL)自然适合这样一个长期优化问题,其中建议者可以确定高潜力项目,积极主动地分配更多的用户印象以促进其增长,从而改善多周期累积性收益。根据这一想法,我们将这一过程建模成一个部分可观测和控制的Markov 决策程序(POC-MDP),并提议一个基于传统方法(RLL-LTV)的长期回报框架(LTV),将项目生命周期值纳入建议中。在RL-LTV中, 与一个比较工具的相对性版本,将一个历史水平排序框架推算出我们未来最接近的LTV标准项目。