For most natural language processing tasks, the dominant practice is to finetune large pretrained transformer models (e.g., BERT) using smaller downstream datasets. Despite the success of this approach, it remains unclear to what extent these gains are attributable to the massive background corpora employed for pretraining versus to the pretraining objectives themselves. This paper introduces a large-scale study of self-pretraining, where the same (downstream) training data is used for both pretraining and finetuning. In experiments addressing both ELECTRA and RoBERTa models and 10 distinct downstream datasets, we observe that self-pretraining rivals standard pretraining on the BookWiki corpus (despite using around $10\times$--$500\times$ less data), outperforming the latter on $7$ and $5$ datasets, respectively. Surprisingly, these task-specific pretrained models often perform well on other tasks, including the GLUE benchmark. Our results suggest that in many scenarios, performance gains attributable to pretraining are driven primarily by the pretraining objective itself and are not always attributable to the incorporation of massive datasets. These findings are especially relevant in light of concerns about intellectual property and offensive content in web-scale pretraining data.
翻译:对于大多数自然语言处理任务,主要做法是使用较小的下游数据集微调大型预先培训的变压器模型(如BERT),使用较小的下游数据集。尽管这一方法取得了成功,但仍不清楚这些收益在多大程度上归因于培训前使用的大量背景公司相对于培训前的目标本身。本文介绍了对自我准备培训的大规模研究,在这种研究中,相同的(下游)培训数据用于培训前和微调。在试验中,ELECTRA模型和RoBERTA模型以及10个不同的下游数据集中,我们发现,自我准备培训对手在BookWikampe上的标准预培训(尽管使用大约10美元-500美元-美元-减去数据),这些收益在多大程度上可以归因于培训前的大规模背景公司对立,而后者的成绩则分别超过7美元和5美元的数据。令人惊讶的是,这些特定任务前培训模式往往在包括GLUE基准等其他任务上表现良好。我们发现,在许多假设中,培训前的绩效增益主要是由培训前目标本身驱动的,而且并非总是归因于大规模知识培训前内容中的有关。