Many open-domain dialogue systems rely on multiple response generators, any of which can contribute a response to the dialogue in a particular context. Thus the ability to compare potential responses and then select the best plays an important role in ensuring a dialogue system is coherent and engaging. Dialogue coherence goes beyond simply remaining on topic -- some trivia may be on topic and engaging when mentioned out of the blue, but may not be coherent and grounded in the context of the conversation. We carry out experiments on response selection in the Athena system, an Alexa Prize SocialBot that has dedicated content and multiple topic-specific response generators for a large number of topics. First, we collect a corpus of Athena conversations with live human traffic, where potential responses from all enabled response generators are logged and subsequently annotated for response quality. We compare several off-the-shelf response ranking methods for open-domain dialogue to Athena-Heuristic, a heuristic response ranker that was field-tested in Athena during the third Alexa Prize competition. We also compare these to a transformer-based response ranker we call Athena-RR, that we train on our Athena conversations. Athena-RR uses both the conversational context and the dialogue state to rank the potential responses. We find that Athena-RR with a Recall@1 of 70.79\% outperforms Athena-Heuristic and all of the off-the-shelf rankers by a large margin. We then conduct a live A/B study comparing Athena-Heuristic to Athena-RR in a 6,358 conversations with Alexa users. We show that Athena-RR leads to significantly longer conversations that receive significantly higher user ratings than the heuristic rule-based ranker.
翻译:许多开放式对话系统依靠多个响应生成器, 其中任何一个都可以对特定背景下的对话做出回应。 因此, 比较潜在响应和选择最佳响应器的能力在确保对话系统的一致性和互动性方面起着重要作用。 对话的一致性不仅仅是停留在主题上, 一些三维在主题上, 被提及时可能涉及主题和接触, 但可能不连贯, 在对话背景下没有依据。 我们在Athena系统中进行了反应选择实验, 一个亚历山德拉奖社会博特( Alexa Prize SocialBot ), 专门为大量议题提供内容和多个专题响应生成器。 首先, 我们收集了一套与实时人流量的雅典娜对话, 对所有功能化响应器的潜在反应进行记录, 并随后附加了回应质量的附加说明。 我们比较了开放式对话的几种非现成反应排序方法, 也就是在第三次亚历克萨斯奖竞赛期间在Athehania进行实地测试的一等级反应。 我们还比较了这些基于变异性应对器的A- hel- an- retural- returna, 我们用Aral 和Real 展示了Aral 的另一种对话中的所有行为。