Multi-task benchmarks such as GLUE and SuperGLUE have driven great progress of pretraining and transfer learning in Natural Language Processing (NLP). These benchmarks mostly focus on a range of Natural Language Understanding (NLU) tasks, without considering the Natural Language Generation (NLG) models. In this paper, we present the General Language Generation Evaluation (GLGE), a new multi-task benchmark for evaluating the generalization capabilities of NLG models across eight language generation tasks. For each task, we continue to design three subtasks in terms of task difficulty (GLGE-Easy, GLGE-Medium, and GLGE-Hard). This introduces 24 subtasks to comprehensively compare model performance. To encourage research on pretraining and transfer learning on NLG models, we make GLGE publicly available and build a leaderboard with strong baselines including MASS, BART, and ProphetNet (The source code and dataset will be publicly available at https://github.com/microsoft/glge).
翻译:GLUE和SuperGLUE等多任务基准推动了在自然语言处理(NLP)的预培训和转让学习方面取得巨大进展。这些基准主要侧重于一系列自然语言理解(NLU)任务,而没有考虑自然语言生成模式。在本文中,我们介绍了通用语言一代评价(GLGE),这是一个新的多任务基准,用于评估通用语言组模式在八种语言生成任务中的通用能力。我们继续根据每项任务设计三个任务难度子任务(GLGE-Easy、GLGE-Medium和GLGE-Hard)。这提出了24个子任务,以全面比较模型性能。为了鼓励就NLG模型的预培训和转让学习进行研究,我们向公众提供GLGE,并建立一个具有强大基线(包括MASS、BART和先知Net)的主导板(源代码和数据集将公布在https://github.com/microft/glgeNet上)。