Progress in machine learning has been driven in large part by massive increases in data. However, large web-scale datasets such as LAION are largely uncurated beyond searches for exact duplicates, potentially leaving much redundancy. Here, we introduce SemDeDup, a method which leverages embeddings from pre-trained models to identify and remove semantic duplicates: data pairs which are semantically similar, but not exactly identical. Removing semantic duplicates preserves performance and speeds up learning. Analyzing a subset of LAION, we show that SemDeDup can remove 50% of the data with minimal performance loss, effectively halving training time. Moreover, performance increases out of distribution. Also, analyzing language models trained on C4, a partially curated dataset, we show that SemDeDup improves over prior approaches while providing efficiency gains. SemDeDup provides an example of how simple ways of leveraging quality embeddings can be used to make models learn faster with less data.
翻译:机器学习的进步在很大程度上受到数据的大幅增加推动。然而, 诸如 LAION 之类的大型网络规模数据集仅通过精确重复项搜索进行粗略的筛选,潜在地存在很多冗余。在此,我们引入 SemDeDup,这是一种利用预训练模型中的嵌入来识别和消除语义重复的方法,即语义上相似但不完全相同的数据对。去除语义上的重复项可以保持性能并加速学习。通过对 LAION 子集进行分析,我们展示了 SemDeDup 可以消除 50% 的数据,且性能损失极小,可以将训练时间缩短一半。此外,性能也在分布外得到提升。此外,通过分析在 C4 上训练的语言模型,我们展示了 SemDeDup 在提供效率增益的同时超越了以前的方法。SemDeDup 提供了一个例子,说明如何利用高质量嵌入的简单方法可以使模型更快地学习并且不需要太多数据。