在REINFORCE建议中为用户满意度进行评分 (Reward Shaping for User Satisfaction in a REINFORCE Recommender)

Konstantina Christakopoulou,Can Xu,Sai Zhang,Sriraj Badam,Trevor Potter,Daniel Li,Hao Wan,Xinyang Yi,Ya Le,Chris Berg,Eric Bencomo Dixon,Ed H. Chi,Minmin Chen

from arxiv, Accepted in Reinforcement Learning for Real Life (RL4RealLife) Workshop in the 38th International Conference on Machine Learning, 2021

How might we design Reinforcement Learning (RL)-based recommenders that encourage aligning user trajectories with the underlying user satisfaction? Three research questions are key: (1) measuring user satisfaction, (2) combatting sparsity of satisfaction signals, and (3) adapting the training of the recommender agent to maximize satisfaction. For measurement, it has been found that surveys explicitly asking users to rate their experience with consumed items can provide valuable orthogonal information to the engagement/interaction data, acting as a proxy to the underlying user satisfaction. For sparsity, i.e, only being able to observe how satisfied users are with a tiny fraction of user-item interactions, imputation models can be useful in predicting satisfaction level for all items users have consumed. For learning satisfying recommender policies, we postulate that reward shaping in RL recommender agents is powerful for driving satisfying user experiences. Putting everything together, we propose to jointly learn a policy network and a satisfaction imputation network: The role of the imputation network is to learn which actions are satisfying to the user; while the policy network, built on top of REINFORCE, decides which items to recommend, with the reward utilizing the imputed satisfaction. We use both offline analysis and live experiments in an industrial large-scale recommendation platform to demonstrate the promise of our approach for satisfying user experiences.

翻译：3个研究问题是关键:(1) 衡量用户满意度,(2) 消除满意度信号的宽度,(3) 调整建议代理人的培训,以最大限度地达到满意度。为衡量,我们发现,明确要求用户在消耗物品方面评估其经验的调查可以为参与/互动数据提供宝贵的正方位信息,作为基本用户满意度的代名词。对于松散,即只能观察用户对用户项目互动的一小部分满意度,估算模型可能有助于预测所有用户所消费物品的满意度。为了学习满足建议政策,我们假定,在RL建议代理人的塑造对于推动用户的满意度经验是强大的。将一切结合起来,我们提议共同学习政策网络和满意度估算网络:估算网络的作用是了解用户满意哪些行动;在REINFORCE顶端建立的政策网络可以用来预测所有用户所消费物品的满意度水平。为了学习满足性政策,我们假设在RL建议中奖赏设计代理人对于驱动用户满意度的经验是强大的。我们建议联合学习一个政策网络和满意度估算网络: 估算网络的作用是了解用户满意度;在REINFORE上建立的政策网络,同时决定哪些项目在现实实验中要用什么项目,我们使用对用户进行成功的评估。

相关内容

Networking

关注 22

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【Meta AI】多模态理解研究进展，Advances in multimodal understanding research at Meta AI

专知会员服务

68+阅读 · 2022年3月20日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日