Rapid development in deep learning model construction has prompted an increased need for appropriate training data. The popularity of large datasets - sometimes known as "big data" - has diverted attention from assessing their quality. Training on large datasets often requires excessive system resources and an infeasible amount of time. Furthermore, the supervised machine learning process has yet to be fully automated: for supervised learning, large datasets require more time for manually labeling samples. We propose a method of curating smaller datasets with comparable out-of-distribution model accuracy after an initial training session using an appropriate distribution of samples classified by how difficult it is for a model to learn from them.
翻译:深层学习模式建设的迅速发展促使对适当培训数据的需求增加,大型数据集的普及性(有时称为“大数据”)已经转移了对其质量评估的注意力。大型数据集培训往往需要过多的系统资源和不可行的时间。此外,监督的机器学习过程尚未完全自动化:为了监督学习,大型数据集需要更多的时间进行手工标签样本。我们提出一种方法,在初步培训课之后,利用适当分配根据模型如何难以从模型中学习而分类的样本来整理具有可比的分布外模型准确性的小数据集。