Transformers-based pretrained language models achieve outstanding results in many well-known NLU benchmarks. However, while pretraining methods are very convenient, they are expensive in terms of time and resources. This calls for a study of the impact of pretraining data size on the knowledge of the models. We explore this impact on the syntactic capabilities of RoBERTa, using models trained on incremental sizes of raw text data. First, we use syntactic structural probes to determine whether models pretrained on more data encode a higher amount of syntactic information. Second, we perform a targeted syntactic evaluation to analyze the impact of pretraining data size on the syntactic generalization performance of the models. Third, we compare the performance of the different models on three downstream applications: part-of-speech tagging, dependency parsing and paraphrase identification. We complement our study with an analysis of the cost-benefit trade-off of training such models. Our experiments show that while models pretrained on more data encode more syntactic knowledge and perform better on downstream applications, they do not always offer a better performance across the different syntactic phenomena and come at a higher financial and environmental cost.
翻译:在很多众所周知的NLU基准中,基于未经训练的变换人语言模型在许多众所周知的NLU基准中取得了杰出的成果。然而,尽管培训前方法非常方便,但在时间和资源方面费用昂贵。这就要求研究培训前数据规模对模型知识的影响。我们利用经过原材料数据增量规模培训的模型,探索对RoBERTA综合能力的影响。首先,我们使用综合结构探测来确定模型是否预先为更多数据编码了更多的合成信息。第二,我们进行了有针对性的综合评估,分析培训前数据规模对模型综合化绩效的影响。第三,我们比较了三个下游应用的不同模型的性能:局部标记、依赖性划分和语音识别。我们用对培训这些模型的成本利得权衡分析来补充我们的研究。我们的实验表明,虽然模型先于更多的数据编码,更多的合成知识,在下游应用方面表现得更好,但它们并非总能在不同的合成技术和环境成本上产生更好的业绩。