Personalisation of language models for dialogue sensitises them to better capture the speaking patterns of people of specific characteristics, and/or in specific environments. However, rich character annotations are difficult to come by and to successfully leverage. In this work, we release and describe a novel set of manual annotations for 863 speakers from the popular Cornell Movie Dialog Corpus, including features like characteristic quotes and character descriptions, and a set of six automatically extracted metadata for over 95% of the featured films. We perform extensive experiments on two corpora and show that such annotations can be effectively used to personalise language models, reducing perplexity by up to 8.5%. Our method can be applied even zero-shot for speakers for whom no prior training data is available, by relying on combinations of characters' demographic characteristics. Since collecting such metadata is costly, we also contribute a cost-benefit analysis to highlight which annotations were most cost-effective relative to the reduction in perplexity.
翻译:----
个性化的语言建模是为了更好地捕捉特定人群和/或特定环境的讲话模式。 然而,获得有效的角色注释并成功地利用对AI研究人员来说是一个难题。 在这项工作中,我们发布并描述了一个对流行的康奈尔电影对话语料库中863位发言人进行了标注的新数据集。 包括特征引语和角色描述等功能,以及超过95%的电影的六个自动提取的元数据。 我们在两个语料库上进行了大量实验,并表明这些注释可以有效地用于个性化语言建模,将困惑度降低了高达8.5%。 即使对于没有先前训练数据的发言人,也可以基于角色的人口特征的组合来应用我们的方法,此时我们的方法是零样本的。由于收集此类元数据成本高昂,因此我们还提供了成本效益分析,以突出哪些注释相对于困惑度降低最为具有成本效益。