Pretrained Language Models (PLMs) have achieved tremendous success in natural language understanding tasks. While different learning schemes -- fine-tuning, zero-shot and few-shot learning -- have been widely explored and compared for languages such as English, there is comparatively little work in Chinese to fairly and comprehensively evaluate and compare these methods. This work first introduces Chinese Few-shot Learning Evaluation Benchmark (FewCLUE), the first comprehensive small sample evaluation benchmark in Chinese. It includes nine tasks, ranging from single-sentence and sentence-pair classification tasks to machine reading comprehension tasks. Given the high variance of the few-shot learning performance, we provide multiple training/validation sets to facilitate a more accurate and stable evaluation of few-shot modeling. An unlabeled training set with up to 20,000 additional samples per task is provided, allowing researchers to explore better ways of using unlabeled samples. Next, we implement a set of state-of-the-art (SOTA) few-shot learning methods (including PET, ADAPET, LM-BFF, P-tuning and EFL), and compare their performance with fine-tuning and zero-shot learning schemes on the newly constructed FewCLUE benchmark.Our results show that: 1) all five few-shot learning methods exhibit better performance than fine-tuning or zero-shot learning; 2) among the five methods, PET is the best performing few-shot method; 3) few-shot learning performance is highly dependent on the specific task. Our benchmark and code are available at https://github.com/CLUEbenchmark/FewCLUE
翻译:预先培训的语言模型(PLM)在自然语言理解任务方面取得了巨大成功。虽然对英语等语言的不同学习计划 -- -- 微调、零点和短片学习 -- -- 进行了广泛探讨和比较,但对于英语等语言的不同学习计划 -- -- 微调、零点和短点学习 -- -- 已经进行了广泛探讨和比较,但中文中文本相对较少的工作,以公平和全面地评价和比较这些方法。这项工作首先采用了中国的少发学习评价基准(FewCLUE),这是中国的第一个综合小型抽样评价基准。它包括九项任务,从单发和句段分类任务到机器阅读理解任务。鉴于少发学习成绩差异很大,我们提供了多种培训/校准组合,以方便更准确和稳定地评价几发模型模型。提供了一种无标签的培训,每次任务有20 000个额外的样本,使研究人员能够探索更好的方法使用未加标签的样本。接下来,我们实施一套“最快”的“最快”的学习方法(SOTO),包括PET、ADAPETET、L-BFF、PL-调制和ELL),将其业绩与微小数点学习计划进行比较。