Selecting a suitable training dataset is crucial for both general-domain (e.g., GPT-3) and domain-specific (e.g., Codex) language models (LMs). We formalize this data selection problem as selecting a subset of a large raw unlabeled dataset to match a desired target distribution, given some unlabeled target samples. Due to the large scale and dimensionality of the raw text data, existing methods use simple heuristics to select data that are similar to a high-quality reference corpus (e.g., Wikipedia), or leverage experts to manually curate data. Instead, we extend the classic importance resampling approach used in low-dimensions for LM data selection. Crucially, we work in a reduced feature space to make importance weight estimation tractable over the space of text. To determine an appropriate feature space, we first show that KL reduction, a data metric that measures the proximity between selected data and the target in a feature space, has high correlation with average accuracy on 8 downstream tasks (r=0.89) when computed with simple n-gram features. From this observation, we present Data Selection with Importance Resampling (DSIR), an efficient and scalable algorithm that estimates importance weights in a reduced feature space (e.g., n-gram features in our instantiation) and selects data with importance resampling according to these weights. When training general-domain models (target is Wikipedia + books), DSIR improves over random selection and heuristic filtering baselines by 2--2.5% on the GLUE benchmark. When performing continued pretraining towards a specific domain, DSIR performs comparably to expert curated data across 8 target distributions.
翻译:选择合适的培训数据集对于一般域(例如,GPT-3)和特定域(例如,codx)语言模型(LMs)都至关重要。我们将这一数据选择问题正式化为选择大型原始无标签数据集的子集,以匹配理想的目标分布,因为有些未贴标签的目标样本。由于原始文本数据的规模和广度很大,现有方法使用简单的惯性来选择与高质量参考集(例如,r2.5)相似的数据(例如,Wikipedia),或利用专家手动整理数据。相反,我们将低二位数据选择中使用的典型重要性重校验方法推广到低二位数据选择数据数据选择的精度。我们通过观察,我们用低二位的域标度数据选择S(我们通过普通的ngr=0.89),通过在普通的 ngrgropeal 上进行数据排序。我们用高比重评估,我们用SimealS