In this work, we evaluate various existing dialogue relevance metrics, find strong dependency on the dataset, often with poor correlation with human scores of relevance, and propose modifications to reduce data requirements and domain sensitivity while improving correlation. Our proposed metric achieves state-of-the-art performance on the HUMOD dataset while reducing measured sensitivity to dataset by 37%-66%. We achieve this without fine-tuning a pretrained language model, and using only 3,750 unannotated human dialogues and a single negative example. Despite these limitations, we demonstrate competitive performance on four datasets from different domains. Our code, including our metric and experiments, is open sourced.
翻译:在这项工作中,我们评估了各种现有的对话相关性指标,发现对数据集的依赖性很强,往往与人类相关得分关系差,并提议修改,以减少数据要求和领域敏感性,同时改善相关性。我们提议的指标在HUMOD数据集上取得了最新业绩,同时将衡量对数据集的敏感度降低了37%至66%。我们没有微调预先培训的语言模式,而只使用3 750次无附加说明的人类对话和一个单一的负面例子。尽管存在这些限制,我们还是在不同领域的四个数据集上展示了竞争性表现。我们的代码,包括我们的计量和实验,都是开放的。