Long-term engagement is preferred over immediate engagement in sequential recommendation as it directly affects product operational metrics such as daily active users (DAUs) and dwell time. Meanwhile, reinforcement learning (RL) is widely regarded as a promising framework for optimizing long-term engagement in sequential recommendation. However, due to expensive online interactions, it is very difficult for RL algorithms to perform state-action value estimation, exploration and feature extraction when optimizing long-term engagement. In this paper, we propose ResAct which seeks a policy that is close to, but better than, the online-serving policy. In this way, we can collect sufficient data near the learned policy so that state-action values can be properly estimated, and there is no need to perform online exploration. Directly optimizing this policy is difficult due to the huge policy space. ResAct instead solves it by first reconstructing the online behaviors and then improving it. Our main contributions are fourfold. First, we design a generative model which reconstructs behaviors of the online-serving policy by sampling multiple action estimators. Second, we design an effective learning paradigm to train the residual actor which can output the residual for action improvement. Third, we facilitate the extraction of features with two information theoretical regularizers to confirm the expressiveness and conciseness of features. Fourth, we conduct extensive experiments on a real world dataset consisting of millions of sessions, and our method significantly outperforms the state-of-the-art baselines in various of long term engagement optimization tasks.
翻译:长期参与优于直接参与顺序建议,因为它直接影响到日常活跃用户(DAUs)和沉滞时间等产品运行指标。与此同时,强化学习(RL)被广泛视为优化长期参与顺序建议的一个有希望的框架。然而,由于在线互动费用昂贵,对于RL算法来说,在优化长期参与时很难进行州-行动价值估计、探索和特征提取。在本文件中,我们提议ResAct Act 寻求一种接近但比在线服务政策更好的政策。这样,我们可以在学习的政策附近收集足够的数据,以便适当估计州-行动价值,而不需要进行在线探索。由于巨大的政策空间,直接优化这一政策是困难的。由于成本高昂的在线互动互动,因此很难在优化长期参与时进行州-行动价值估计、探索和特征的优化。我们的主要贡献是四倍。首先,我们设计了一个归正模型,通过抽样多个行动估测者来重建在线服务政策的行为。第二,我们设计一个有效的学习模式来培训残余的行为者,以便适当估计状态,而无需进行在线探索时间,我们用两种方式来大幅地验证世界范围的数据。第三,我们用模拟方法来验证。我们用两种方法来改进。