Building pretrained language models is considered expensive and data-intensive, but must we increase dataset size to achieve better performance? We propose an alternative to larger training sets by automatically identifying smaller yet domain-representative subsets. We extend Cynical Data Selection, a statistical sentence scoring method that conditions on a representative target domain corpus. As an example, we treat the OntoNotes corpus as a target domain and pretrain a RoBERTa-like encoder from a cynically selected subset of the Pile. On both perplexity and across several downstream tasks in the target domain, it consistently outperforms random selection with 20x less data, 3x fewer training iterations, and 2x less estimated cloud compute cost, validating the recipe of automatic document selection for LM pretraining.
翻译:建立经过培训的语文模型被认为是昂贵和数据密集型的,但我们必须增加数据集规模,以取得更好的业绩吗?我们提出一个替代大型培训组的替代方案,自动识别较小但具有域代表性的子集。我们推广了合成数据选择方法,这是一个统计性句评分方法,该方法以具有代表性的目标域名为条件。举例来说,我们把Onto Notes profile作为一个目标域,从玩世不恭地从Pile中挑选的子集中预设了一个类似RoBERTA的编码器。对于在目标域的模糊性和跨越若干下游任务,它始终优于随机选择,数据减少20x,培训迭代数减少3x,以及估计云度计算成本减少2x,这验证了LM预修前自动选择文件的配方。