Pretrained Language Models (PLMs) have achieved tremendous success in natural language understanding tasks. While different learning schemes -- fine-tuning, zero-shot, and few-shot learning -- have been widely explored and compared for languages such as English, there is comparatively little work in Chinese to fairly and comprehensively evaluate and compare these methods and thus hinders cumulative progress. In this paper, we introduce the Chinese Few-shot Learning Evaluation Benchmark (FewCLUE), the first comprehensive few-shot evaluation benchmark in Chinese. It includes nine tasks, ranging from single-sentence and sentence-pair classification tasks to machine reading comprehension tasks. We systematically evaluate five state-of-the-art (SOTA) few-shot learning methods (including PET, ADAPET, LM-BFF, P-tuning and EFL), and compare their performance with fine-tuning and zero-shot learning schemes on the newly constructed FewCLUE benchmark. Experimental results reveal that: 1) The effect of different few-shot learning methods is sensitive to the pre-trained model to which the methods are applied; 2) PET and P-tuning achieve the best overall performance with RoBERTa and ERNIE respectively. Our benchmark is used in the few-shot learning contest of NLPCC 2021. In addition, we provide a user-friendly toolkit, as well as an online leaderboard to help facilitate further progress on Chinese few-shot learning. We provide a baseline performance on different learning methods, a reference for future research.
翻译:预先培训的语言模型(PLM)在自然语言理解任务方面取得了巨大成功。虽然对英语等语言的不同学习计划 -- -- 微调、零出和少见的学习计划 -- -- 进行了广泛探讨和比较,但中国在公正和全面评价和比较这些方法方面所做的工作相对较少,因而阻碍了累积进展。在本文中,我们采用了中国第一个综合的少见评价基准(FewCLUE),这是中文的第一个综合的少见评价基准。它包括九项任务,从单发和句式分类任务到机器阅读任务。我们系统地评价了五个最先进的学习方法(SOTA)(SOTA),少发的学习方法(包括PET、ADAPET、LM-BFF、P-BFF和EFLFL),并将其业绩与微调和零分数学习计划进行比较。实验结果显示:(1) 不同微分的学习方法对采用的方法的预先培训模式十分敏感;(2) 少数参考和调整了与ROBERTA和ERIE(ERIT)分别用来作为未来学习基准,我们提供了20学习工具的学习基准。