Language models increasingly rely on massive web dumps for diverse text data. However, these sources are rife with undesirable content. As such, resources like Wikipedia, books, and newswire often serve as anchors for automatically selecting web text most suitable for language modeling, a process typically referred to as quality filtering. Using a new dataset of U.S. high school newspaper articles -- written by students from across the country -- we investigate whose language is preferred by the quality filter used for GPT-3. We find that newspapers from larger schools, located in wealthier, educated, and urban ZIP codes are more likely to be classified as high quality. We then demonstrate that the filter's measurement of quality is unaligned with other sensible metrics, such as factuality or literary acclaim. We argue that privileging any corpus as high quality entails a language ideology, and more care is needed to construct training corpora for language models, with better transparency and justification for the inclusion or exclusion of various texts.
翻译:语言模式越来越依赖大型网页垃圾堆积,以获取多种文本数据。然而,这些源头充斥着不受欢迎的内容。 因此,维基百科、书籍和新闻线等资源往往成为自动选择最适合语言模型的网络文本的锚点,通常被称为质量过滤。我们使用美国高中报纸文章的新数据集(由全国各地的学生撰写)调查其语言被GPT-3使用的质量过滤器所偏爱的。我们发现来自较富裕、受过教育、城市ZIP代码的较大学校的报纸更有可能被归类为高品质。我们然后证明过滤器的质量衡量与其他明智的衡量标准不相容,例如事实质量或文学标语。我们说,任何高质量的文件都包含一种语言意识形态,需要更加小心地为语言模式构建培训公司,更透明地说明将各种文本纳入或排除在外的理由。