Linguistic pragmatics state that a conversation's underlying speech acts can constrain the type of response which is appropriate at each turn in the conversation. When generating dialogue responses, neural dialogue agents struggle to produce diverse responses. Currently, dialogue diversity is assessed using automatic metrics, but the underlying speech acts do not inform these metrics. To remedy this, we propose the notion of Pragmatically Appropriate Diversity, defined as the extent to which a conversation creates and constrains the creation of multiple diverse responses. Using a human-created multi-response dataset, we find significant support for the hypothesis that speech acts provide a signal for the diversity of the set of next responses. Building on this result, we propose a new human evaluation task where creative writers predict the extent to which conversations inspire the creation of multiple diverse responses. Our studies find that writers' judgments align with the Pragmatically Appropriate Diversity of conversations. Our work suggests that expectations for diversity metric scores should vary depending on the speech act.
翻译:在语言语用学中,对话的本质言语行为可以制约对话中每个回合中合适的回应类型。生成对话回应时,神经对话代理难以产生多样性的回应。目前,对话多样性是通过自动度量来评估的,但文字语言言语行为并不用于这些度量。为了解决这个问题,我们提出了实用多样性的概念,这是指对话创造和限制产生多个不同回应的程度。使用人类创建的多响应数据集,我们发现了一个假设,即言语行为为下一步的回应集的多样性提供信号。在此基础上,我们提出了一项新的人类评估任务,创意作家预测对话产生多样性的程度。我们的研究发现,作家的判断与对话的实用多样性相一致。我们的工作表明评估多样性度量分数的期望应根据言语行为的不同而有所不同。