Improving user retention with reinforcement learning~(RL) has attracted increasing attention due to its significant importance in boosting user engagement. However, training the RL policy from scratch without hurting users' experience is unavoidable due to the requirement of trial-and-error searches. Furthermore, the offline methods, which aim to optimize the policy without online interactions, suffer from the notorious stability problem in value estimation or unbounded variance in counterfactual policy evaluation. To this end, we propose optimizing user retention with Decision Transformer~(DT), which avoids the offline difficulty by translating the RL as an autoregressive problem. However, deploying the DT in recommendation is a non-trivial problem because of the following challenges: (1) deficiency in modeling the numerical reward value; (2) data discrepancy between the policy learning and recommendation generation; (3) unreliable offline performance evaluation. In this work, we, therefore, contribute a series of strategies for tackling the exposed issues. We first articulate an efficient reward prompt by weighted aggregation of meta embeddings for informative reward embedding. Then, we endow a weighted contrastive learning method to solve the discrepancy between training and inference. Furthermore, we design two robust offline metrics to measure user retention. Finally, the significant improvement in the benchmark datasets demonstrates the superiority of the proposed method.
翻译:通过强化学习(RL)提高用户保留率,因其在推动用户参与方面的重要性而引起越来越多的关注。然而,由于需要试用和紧急搜索,从零开始培训RL政策不会损害用户的经验,这是不可避免的。此外,旨在不进行在线互动而优化政策的离线方法,在价值估计方面存在着臭名昭著的稳定问题,或者反事实政策评价方面出现了无限制的差异。为此,我们提议以决定变换(DT)优化用户保留率,通过将RL转化为自动递增问题,避免脱线困难。然而,在建议中采用DT是一个非三重问题,因为存在以下挑战:(1) 数字奖励价值的建模不足;(2) 政策学习与建议生成之间的数据差异;(3) 离线性业绩评估不可靠。因此,我们在这项工作中提出了一系列解决暴露问题的战略。我们首先通过将信息化奖赏嵌入的元嵌入加权组合来说明一种有效的奖赏。然后,我们提出一个加权对比学习方法,以解决培训与最后衡量优势度标准之间差异。我们设计了两种强有力的用户衡量标准。</s>