Not all data are equal. Misleading or unnecessary data can critically hinder the accuracy of Machine Learning (ML) models. When data is plentiful, misleading effects can be overcome, but in many real-world applications data is sparse and expensive to acquire. We present a method that substantially reduces the data size necessary to accurately train ML models, potentially opening the door for many new, limited-data applications in ML. Our method extracts the most informative data, while ignoring and omitting data that misleads the ML model to inferior generalization properties. Specifically, the method eliminates the phenomena of "double descent", where more data leads to worse performance. This approach brings several key features to the ML community. Notably, the method naturally converges and removes the traditional need to divide the dataset into training, testing, and validation data. Instead, the selection metric inherently assesses testing error. This ensures that key information is never wasted in testing or validation.
翻译:并非所有数据都是相等的。 错误领导或不必要的数据会严重妨碍机器学习模型的准确性。 当数据繁多时, 误导效应是可以克服的, 但在许多真实世界的应用数据中, 获取的数据很少, 费用昂贵。 我们提出的方法是大幅降低准确培训 ML 模型所需的数据规模, 有可能为ML 中许多新的、 有限数据应用打开大门。 我们的方法提取了信息量最大的数据, 却忽略了数据, 从而误导 ML 模型, 进而降低一般化特性 。 具体地说, 该方法消除了“ 双向下降” 现象, 更多的数据导致更差的性能。 这个方法为ML 社区带来了几个关键特征 。 值得注意的是, 该方法自然地将数据集成分为培训、 测试 和 验证数据的传统需要。 相反, 选择的参数必然评估测试错误。 这确保关键信息在测试或验证过程中不会被浪费 。