Whereas much of the success of the current generation of neural language models has been driven by increasingly large training corpora, relatively little research has been dedicated to analyzing these massive sources of textual data. In this exploratory analysis, we delve deeper into the Common Crawl, a colossal web corpus that is extensively used for training language models. We find that it contains a significant amount of undesirable content, including hate speech and sexually explicit content, even after filtering procedures. We conclude with a discussion of the potential impacts of this content on language models and call for more mindful approach to corpus collection and analysis.
翻译:虽然目前一代神经语言模式的成功在很大程度上是由日益庞大的培训公司驱动的,但用于分析这些大量文本数据来源的研究相对较少。在这项探索性分析中,我们深入研究共同拼图,这是一个庞大的网络资料库,广泛用于培训语言模型。我们发现,即使在过滤程序之后,它含有大量不受欢迎的内容,包括仇恨言论和性明确的内容。我们最后讨论这一内容对语言模型的潜在影响,呼吁对物证收集和分析采取更加谨慎的方法。