Recent progress on neural approaches for language processing has triggered a resurgence of interest on building intelligent open-domain chatbots. However, even the state-of-the-art neural chatbots cannot produce satisfying responses for every turn in a dialog. A practical solution is to generate multiple response candidates for the same context, and then perform response ranking/selection to determine which candidate is the best. Previous work in response selection typically trains response rankers using synthetic data that is formed from existing dialogs by using a ground truth response as the single appropriate response and constructing inappropriate responses via random selection or using adversarial methods. In this work, we curated a dataset where responses from multiple response generators produced for the same dialog context are manually annotated as appropriate (positive) and inappropriate (negative). We argue that such training data better matches the actual use case examples, enabling the models to learn to rank responses effectively. With this new dataset, we conduct a systematic evaluation of state-of-the-art methods for response selection, and demonstrate that both strategies of using multiple positive candidates and using manually verified hard negative candidates can bring in significant performance improvement in comparison to using the adversarial training data, e.g., increase of 3% and 13% in Recall@1 score, respectively.
翻译:语言处理神经方法最近的进展引发了人们对建立智能开放的空心聊天室的兴趣。 但是,即使是最先进的神经聊天室也无法对对话的每一转转都产生令人满意的反应。 一个实际的解决办法是,为同一背景产生多个响应候选人,然后进行响应排名/选择,以确定哪个是最佳候选人。 先前的响应选择工作通常利用现有对话产生的合成数据,用地面真实响应作为单一适当回应,并通过随机选择或使用对抗方法构建不适当的反应,来培训响应排层。在这项工作中,我们整理了一个数据集,其中为同一对话背景制作的多个响应器的响应无法手动地说明适当(积极)和不适当(消极)的。我们说,这种培训数据更符合实际使用案例的实例,使模型能够有效地学习对响应进行排位。有了这个新的数据集,我们系统评估了使用多个正面候选人和使用人工核实的硬性候选人的战略,可以带来显著的业绩改进,与使用对抗性差数据分别提高13 % 。