Recently, self-learning methods based on user satisfaction metrics and contextual bandits have shown promising results to enable consistent improvements in conversational AI systems. However, directly targeting such metrics by off-policy bandit learning objectives often increases the risk of making abrupt policy changes that break the current user experience. In this study, we introduce a scalable framework for supporting fine-grained exploration targets for individual domains via user-defined constraints. For example, we may want to ensure fewer policy deviations in business-critical domains such as shopping, while allocating more exploration budget to domains such as music. Furthermore, we present a novel meta-gradient learning approach that is scalable and practical to address this problem. The proposed method adjusts constraint violation penalty terms adaptively through a meta objective that encourages balanced constraint satisfaction across domains. We conduct extensive experiments using data from a real-world conversational AI on a set of realistic constraint benchmarks. Based on the experimental results, we demonstrate that the proposed approach is capable of achieving the best balance between the policy value and constraint satisfaction rate.
翻译:最近,基于用户满意度指标和背景强盗的自学方法显示出令人乐观的结果,使对话性自主系统得以持续改进。然而,直接针对非政策性强盗学习目标的这类衡量方法,往往会增加突然改变政策以打破当前用户经验的风险。在本研究中,我们引入了一个可扩展的框架,通过用户定义的限制,支持个别领域的细微探索目标。例如,我们可能希望确保减少诸如购物等商业关键领域的政策偏差,同时将更多的勘探预算分配给音乐等领域。此外,我们提出了一种新颖的元进化学习方法,该方法可以伸缩和实用地解决这一问题。拟议方法通过鼓励平衡地制约各领域满意度的元目标,调整约束性违反处罚的适应性,我们利用现实世界对话性AI的数据进行一系列现实性制约性基准的广泛实验。根据实验结果,我们证明拟议的方法能够实现政策价值与约束性满意度之间的最佳平衡。