Chatbots are expected to be knowledgeable across multiple domains, e.g. for daily chit-chat, exchange of information, and grounding in emotional situations. To effectively measure the quality of such conversational agents, a model-based automatic dialogue evaluation metric (ADEM) is expected to perform well across multiple domains. Despite significant progress, an ADEM that works well in one domain does not necessarily generalize to another. This calls for a dedicated network architecture for domain generalization. To tackle the multi-domain dialogue evaluation task, we propose a Panel of Experts (PoE), a multitask network that consists of a shared transformer encoder and a collection of lightweight adapters. The shared encoder captures the general knowledge of dialogues across domains, while each adapter specializes in one specific domain and serves as a domain expert. To validate the idea, we construct a high-quality multi-domain dialogue dataset leveraging data augmentation and pseudo-labeling. The PoE network is comprehensively assessed on 16 dialogue evaluation datasets spanning a wide range of dialogue domains. It achieves state-of-the-art performance in terms of mean Spearman correlation over all the evaluation datasets. It exhibits better zero-shot generalization than existing state-of-the-art ADEMs and the ability to easily adapt to new domains with few-shot transfer learning.
翻译:热电波预计将在多个领域知识丰富,例如用于日常聊天、信息交流和在情感状况下立足。为了有效测量这些对话工具的质量,一个基于模型的自动对话评价指标(ADEM)预计将在多个领域运行良好。尽管取得了显著进展,一个在一个领域运作良好的ADEM并不一定会向另一个领域推广。这要求建立一个专门的网络架构,用于对域进行概括化。为了应对多领域对话的评价工作,我们建议成立一个专家小组(PoE),一个多任务网络,由共享变压器编码器和轻量适应器组成。共享编码器将捕捉跨区域对话的一般知识,而每个适应器将专门在一个特定领域进行,并担任域专家。为了验证这一想法,我们建造了一个高质量的多领域对话数据集,利用数据扩增和假标签。PoE网络在16个对话评价数据集中进行了全面评估,覆盖了广泛的对话域域。在平均SPODER的升级能力方面,它取得了比普通SODER的升级能力更佳的状态,它比普通SODERman的升级到所有现有数据库。