从登录用户反馈中学习多行动对话政策</s> (Multi-Action Dialog Policy Learning from Logged User Feedback)

Multi-action dialog policy, which generates multiple atomic dialog actions per turn, has been widely applied in task-oriented dialog systems to provide expressive and efficient system responses. Existing policy models usually imitate action combinations from the labeled multi-action dialog examples. Due to data limitations, they generalize poorly toward unseen dialog flows. While reinforcement learning-based methods are proposed to incorporate the service ratings from real users and user simulators as external supervision signals, they suffer from sparse and less credible dialog-level rewards. To cope with this problem, we explore to improve multi-action dialog policy learning with explicit and implicit turn-level user feedback received for historical predictions (i.e., logged user feedback) that are cost-efficient to collect and faithful to real-world scenarios. The task is challenging since the logged user feedback provides only partial label feedback limited to the particular historical dialog actions predicted by the agent. To fully exploit such feedback information, we propose BanditMatch, which addresses the task from a feedback-enhanced semi-supervised learning perspective with a hybrid objective of semi-supervised learning and bandit learning. BanditMatch integrates pseudo-labeling methods to better explore the action space through constructing full label feedback. Extensive experiments show that our BanditMatch outperforms the state-of-the-art methods by generating more concise and informative responses. The source code and the appendix of this paper can be obtained from https://github.com/ShuoZhangXJTU/BanditMatch.

翻译：多行动对话政策在面向任务的对话系统中广泛应用,它产生多原子每转一次的多次对话动作,以提供表达式和高效的系统响应。现有的政策模型通常仿照标签的多动作对话示例中的动作组合。由于数据限制,它们向看不见的对话框流的概括性差。虽然建议强化基于学习的方法将实际用户和用户模拟器的服务评级作为外部监督信号纳入其中,但是它们却受到很少和不那么可信的对话层面的奖励。为了解决这个问题,我们探索如何改进多动作对话政策学习,利用为历史预测收到的明确和隐含的转弯级用户反馈(即,登录的用户反馈)来收集并忠实于真实世界的情景。由于数据有限,它们向未知的对未知的对话框流传播不善。为了充分利用这些反馈,我们建议BanditMatch从反馈增强半超强的半超额学习角度来应对任务,同时实现半超额超额学习和按键学习的混合目标。BandatchMatch用户反馈,这是一项挑战性的任务,因为登录用户反馈仅提供部分标签反馈,通过服务器/BasmalimalMsimmalimmalMassimmalmassimmassimmal 建立整个的系统,通过Breal-halmalmalmmmmmalmusmusmmmmmmmmmmmmusmmmmmmmmmmmmmmmmusmusmusmusmmmmmmmmus。</s>