While conventional wisdom suggests that more aggressively filtering data from low-quality sources like Common Crawl always monotonically improves the quality of training data, we find that aggressive filtering can in fact lead to a decrease in model quality on a wide array of downstream tasks for a GPT-like language model. We speculate that this is because optimizing sufficiently strongly for a proxy metric harms performance on the true objective, suggesting a need for more robust filtering objectives when attempting to filter more aggressively. We hope this work leads to detailed analysis of the effects of dataset filtering design choices on downstream model performance in future work.
翻译:传统智慧认为,更积极地从低质量来源(如共同的Crawl)过滤数据,总是单调地提高培训数据的质量,但我们发现,积极性的过滤实际上会导致类似GPT语言模式的一系列广泛的下游任务模型质量下降。 我们推测,这是因为,这样做是因为对代用指标的优化足够强烈,会损害真实目标的绩效,表明在试图更积极过滤时需要更强有力的过滤目标。 我们希望,这项工作将导致详细分析数据集过滤设计选择对下游模式未来工作业绩的影响。