个性化任务导向对话系统：通过零样本可推广的奖励函数 (Personalizing Task-oriented Dialog Systems via Zero-shot Generalizable Reward Function)

Task-oriented dialog systems enable users to accomplish tasks using natural language. State-of-the-art systems respond to users in the same way regardless of their personalities, although personalizing dialogues can lead to higher levels of adoption and better user experiences. Building personalized dialog systems is an important, yet challenging endeavor and only a handful of works took on the challenge. Most existing works rely on supervised learning approaches and require laborious and expensive labeled training data for each user profile. Additionally, collecting and labeling data for each user profile is virtually impossible. In this work, we propose a novel framework, P-ToD, to personalize task-oriented dialog systems capable of adapting to a wide range of user profiles in an unsupervised fashion using a zero-shot generalizable reward function. P-ToD uses a pre-trained GPT-2 as a backbone model and works in three phases. Phase one performs task-specific training. Phase two kicks off unsupervised personalization by leveraging the proximal policy optimization algorithm that performs policy gradients guided by the zero-shot generalizable reward function. Our novel reward function can quantify the quality of the generated responses even for unseen profiles. The optional final phase fine-tunes the personalized model using a few labeled training examples. We conduct extensive experimental analysis using the personalized bAbI dialogue benchmark for five tasks and up to 180 diverse user profiles. The experimental results demonstrate that P-ToD, even when it had access to zero labeled examples, outperforms state-of-the-art supervised personalization models and achieves competitive performance on BLEU and ROUGE metrics when compared to a strong fully-supervised GPT-2 baseline

翻译：任务导向的对话系统可以使用户通过自然语言完成任务。最先进的系统不考虑用户的个性，虽然个性化的对话可以提高用户采纳度和更好的用户体验。构建个性化的对话系统是一个重要但具有挑战性的任务，只有少数几项工作承担了这一挑战。大多数现有的工作依赖于监督学习的方法，并需要为每个用户配置的烦琐和昂贵的标记训练数据。此外，为每个用户配置和标记数据事实上是不可能的。在这项工作中，我们提出了一个新的框架P-ToD，以一种无人监督的方式通过零样本可推广的奖励函数来个性化任务导向的对话系统，以适应各种用户个人信息。 P-ToD使用预先训练的GPT-2作为骨干模型，以三个阶段工作。第一阶段执行任务特定的训练。第二阶段通过利用近端策略优化算法启动无人监督的个性化，该算法执行基于零样本可推广奖励函数指导的策略梯度。我们的新型奖励函数可以为未见过的用户个人信息量化生成的回复的质量。可选的最后一阶段使用几个已标记的训练示例对个性化模型进行微调。我们使用涉及五个任务和多达180个不同用户个人信息的个性化bAbI对话基准进行了广泛的实验分析。实验结果表明，即使在零标记示例的情况下，P-ToD也优于最先进的监督个性化模型，并且相对于强有力的完全监督GPT-2基线在BLEU和ROUGE指标上获得了有竞争力的表现。