Realizing general-purpose language intelligence has been a longstanding goal for natural language processing, where standard evaluation benchmarks play a fundamental and guiding role. We argue that for general-purpose language intelligence evaluation, the benchmark itself needs to be comprehensive and systematic. To this end, we propose CUGE, a Chinese Language Understanding and Generation Evaluation benchmark with the following features: (1) Hierarchical benchmark framework, where datasets are principally selected and organized with a language capability-task-dataset hierarchy. (2) Multi-level scoring strategy, where different levels of model performance are provided based on the hierarchical framework. To facilitate CUGE, we provide a public leaderboard that can be customized to support flexible model judging criteria. Evaluation results on representative pre-trained language models indicate ample room for improvement towards general-purpose language intelligence. CUGE is publicly available at cuge.baai.ac.cn.
翻译:实现通用语言情报是自然语言处理的长期目标,标准评价基准在其中起着根本和指导作用。我们争辩说,对于通用语言情报评价,基准本身必须是全面和系统的。为此,我们提议CUGE,即中国语言理解和生成评价基准,其特点如下:(1) 等级基准框架,其中主要选择数据集,并以语言能力任务-数据集等级排列。(2) 多级评分战略,其中根据等级框架提供不同程度的示范业绩。我们为CUGE提供一个公共领导板,可以定制,支持灵活的示范评分标准。关于有代表性的预先培训的语言模型的评价结果表明,在通用语言情报方面有很大的改进余地。CUGE在cuge.baai.c.n公开提供。