Progress in machine learning has been driven in large part by massive increases in data. However, large web-scale datasets such as LAION are largely uncurated beyond searches for exact duplicates, potentially leaving much redundancy. Here, we introduce SemDeDup, a method which leverages embeddings from pre-trained models to identify and remove semantic duplicates: data pairs which are semantically similar, but not exactly identical. Removing semantic duplicates preserves performance and speeds up learning. Analyzing a subset of LAION, we show that SemDeDup can remove 50% of the data with minimal performance loss, effectively halving training time. Moreover, performance increases out of distribution. Also, analyzing language models trained on C4, a partially curated dataset, we show that SemDeDup improves over prior approaches while providing efficiency gains. SemDeDup provides an example of how simple ways of leveraging quality embeddings can be used to make models learn faster with less data.
翻译:机器学习的进展在很大程度上是由数据大规模增加所驱动的。 但是,大型网络规模的数据集,如LAION, 基本上没有完成, 无法查找准确的重复数据, 可能会留下很多冗余。 在这里, 我们引入了 SemDeDup, 这是一种利用预先培训过的模型嵌入来识别和清除语义重复数据的方法: 数据对齐, 它们在语义上是相似的, 但并不完全相同。 去除语义重复可以保存性能和加速学习。 分析LAION的一个子集, 我们显示SemDeDup 能够以最小的性能损失来移除50%的数据, 有效地将培训时间减半。 此外, 性能会从分布中增加。 此外, 分析在C4 上培训的语言模型, 一个部分整理的数据集, 我们显示SemDeup 在提供效率收益的同时, 会比先前的方法改进。 SemDeup 提供了一个例子, 如何使用简单的方法来利用质量嵌入来让模型更快地学习 。</s>