An intelligent dialogue system in a multi-turn setting should not only generate the responses which are of good quality, but it should also generate the responses which can lead to long-term success of the dialogue. Although, the current approaches improved the response quality, but they over-look the training signals present in the dialogue data. We can leverage these signals to generate the weakly supervised training data for learning dialog policy and reward estimator, and make the policy take actions (generates responses) which can foresee the future direction for a successful (rewarding) conversation. We simulate the dialogue between an agent and a user (modelled similar to an agent with supervised learning objective) to interact with each other. The agent uses dynamic blocking to generate ranked diverse responses and exploration-exploitation to select among the Top-K responses. Each simulated state-action pair is evaluated (works as a weak annotation) with three quality modules: Semantic Relevant, Semantic Coherence and Consistent Flow. Empirical studies with two benchmarks indicate that our model can significantly out-perform the response quality and lead to a successful conversation on both automatic evaluation and human judgement.
翻译:在多转环境中的智能对话系统不仅应当产生高质量的反应,而且还应当产生能够导致对话长期成功的反应。虽然目前的方法提高了反应质量,但是它们过于看重对话数据中存在的培训信号。我们可以利用这些信号来生成监督不力的培训数据,用于学习对话政策和奖赏估测器,并使政策采取能够预见成功(奖励)对话未来方向的行动(基因反应)。我们模拟代理人与用户(类似于有监督学习目标的代理人)之间的对话,以便彼此互动。代理利用动态阻塞来生成排位不同的反应和探索探索探索,以便在顶级反应中进行选择。对每对模拟状态行动配对的评价(工作作为微弱的注解)有三个质量模块:语义相关、语义一致性和一致性流动。有两个基准的“经验性”研究表明,我们的模型可以大大超出反应质量,并导致关于自动评价和人判断的成功对话。