Conversational recommender systems (CRS) explicitly solicit users' preferences for improved recommendations on the fly. Most existing CRS solutions count on a single policy trained by reinforcement learning for a population of users. However, for users new to the system, such a global policy becomes ineffective to satisfy them, i.e., the cold-start challenge. In this paper, we study CRS policy learning for cold-start users via meta-reinforcement learning. We propose to learn a meta policy and adapt it to new users with only a few trials of conversational recommendations. To facilitate fast policy adaptation, we design three synergetic components. Firstly, we design a meta-exploration policy dedicated to identifying user preferences via a few exploratory conversations, which accelerates personalized policy adaptation from the meta policy. Secondly, we adapt the item recommendation module for each user to maximize the recommendation quality based on the collected conversation states during conversations. Thirdly, we propose a Transformer-based state encoder as the backbone to connect the previous two components. It provides comprehensive state representations by modeling complicated relations between positive and negative feedback during the conversation. Extensive experiments on three datasets demonstrate the advantage of our solution in serving new users, compared with a rich set of state-of-the-art CRS solutions.
翻译:比较建议系统(CRS) 明确要求用户偏好改进建议。 多数现有的CRS解决方案都依靠通过强化用户群学习而培训的单一政策。 但是,对于系统的新用户来说,这种全球政策变得无效,无法满足他们的要求, 也就是说, 冷启动挑战。 在本文中, 我们研究冷启动用户的CRS政策学习, 通过元加强学习。 我们提议学习一个元政策, 并把它适应于只有几个谈话建议试验的新用户。 为了促进快速政策适应, 我们设计了三个协同部分。 首先, 我们设计了一个元探索政策, 专门通过一些探索性对话确定用户的偏好, 加速元政策的个人化调整。 其次, 我们调整每个用户的项目建议模块, 以便根据所收集的对话状态, 最大限度地提高建议质量。 第三, 我们提议以变换器为基础的国家编码作为连接前两个部分的主干线。 它通过在对话中模拟正面和负面反馈之间的复杂关系, 提供了全面的国家表示方式。 在三个富有的用户群集中, 展示了我们新解决方案的优势。